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ABSTRACT 

A  recurring  problem  faced  by  many  analysts  is  that  of  devising 
estimating  procedures  for  predicting  some  aspect  of  the  future  from  rather 
meager  data.  This  is  particularly  true  for  the  cost  analyst  who  is  con¬ 
cerned  with  estimating  the  resource  requirements  of  future  military  systems. 

Historical  Simulation  is  a  method  of  evaluating  candidate  (cost) 
estimating  procedures  on  the  basis  of  their  ability  to  simulate  predictions 
using  data  that  would  have  been  available.  For  example,  assume  that  a 
particular  data  base  consists  of  perhaps  15  data  points  ordered  in  time; 
a  typical  simulated  prediction  would  entail  using  a  candidate  estimating 
procedure  to  predict  point  10  using  only  the  information  available  in  the 
first  nine  data  points.  All  candidate  estimating  procedures  would  then  be 
evaluated  on  how  well  their  simulated  predictions  compare  with  the  actual 
data  points. 

In  this  fashion,  Historical  Simulation  avoids  relying  on  the  central 
evaluation  assumption  of  Regression  Theory,  namely,  that  which  fits  the 
past  data  best  will  predict  the  future  best.  This  conceptual  difference 
gives  Historical  Simulation  several,  unique  features,  among  which  are 

1.  The  demonstration  of  an  estimating  procedure's  capability  to 
make  predictions  of  those  points  in  the  data  base  which  are 
extrapolations  from  the  previous  data 

2.  The  ability  to  directly  compare  a  wider  class  of  estimating 
procedures  than  can  be  compared  by  the  usual  regression 
Lectin  iques 
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3.  The  ability  to  evaluate  estimating  procedures  derived  from 
stepwise  regression  independent  of  the  selection  process 
utilized  in  that  technique 

4.  The  use  of  an  easy-to-communicate  summary  statistic  for 
describing  the  accuracy  of  predictions. 

Hence,  Historical  Simulation  provides  additional  information  which, 
when  used  in  conjunction  with  the  usual  regression  techniques,  should  lead 
to  a  better  evaluation  of  candidate  estimating  procedures,  particularly 
when  the  prediction  problem  is  characterized  by  extrapolation  from  a  small 
data  base. 

The  report  is  in  two  volumes.  The  first,  which  is  unclassified, 
completely  describes  the  technique.  Included  is  a  discussion  of  reasons 
leading  up  to  the  development  of  Historical  Simulation  as  well  as  a  des¬ 
cription  of  the  technique  and  of  possible  ways  to  summarize  and  interpret 
the  output.  Volume  2,  classified  Confidential  (Privileged  Information), 
illustrates  the  use  of  Historical  Simulation  by  describing  the  results  of 
applying  the  technique  to  cost  and  man-hour  estimating  procedures  for 

Vc 

selected  aircraft  programs. 


The  reader  interested  largely  in  a  nontechnical  overview  may  prefer 
C.A.  Graver,  Progress  Report  On  The  Development  of  Historical  Simulation, 
General  Research  Corp  IMR-950,  March  1969,  which  was  delivered  at  the 
1969  DoD  Cost  Research  Symposium.-' 
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I.  INTRODUCTION 

The  purpose  of  this  report  is  to  describe  the  progress  made  in  the 
development  of  Historical  Simulation,  a  procedure  for  the  evaluation  of 
Cost  Estimating  Procedures  (or  Cost  Estimating  Relationships) .  The  work 
is  being  sponsored  by  the  Director  of  Economics  and  Resource  Analysis, 
Office  of  the  Assistant  Secretary  of  Defense  (Systems  Analysis)  under 
Contract  Number  DAHC15-68-C-0364 . 

As  parts  of  this  report  are  fairly  technical,  the  reader  interested 
largely  in  a  nontechnical  overview  of  the  Historical  Simulation  procedure 
is  referred  to  the  paper  delivered  at  the  recent  1969  DoD  Cost  Research 
Symposium  (Ref.  1  of  this  volume) . 

A .  BACKGROUND 

The  current  emphasis  on  systems  analysis,  while  it  has  greatly 
enhanced  the  decision-making  capabilities  of  defense  policy  makers,  has 
placed  a  difficult  requirement  on  cost  analysts.  Working  with  functional 
cost  models  which  utilize  a  description  of  the  system  in  terms  of  its 
most  basic  physical  or  performance  characteristics,  the  analyst  is  asked 
to  make  estimates  which  often  require  extrapolations  from  extremely  meager 
data.  These  estimates  are  used  in  the  evaluation  of  which  candidate 
system  is  to  be  pursued . 

Because  the  generally  sparse  nature  of  the  data  tends  to  obscure 
genuine  functional  trends,  the  analyst  must  go  to  great  pains  to  fully 
utilize  all  the  information  his  data  base  contains.  While  the  cost  analyst 
has  at  his  disposal  a  number  of  tools,  e.g.,  linear  regression  techniques, 
any  additional  too]  that  summarizes  different  information  from  the  data 
base,  such  as  Historical  Simulation  promises  to  be,  is  worthwhile. 

Traditionally,  in  the  development  of  a  cost  estimating  relationship 
(CER) ,  the  cost  analyst  first  postulates  a  functional  relationship  that 
hopefully  will  reflect  the  cost  generating  relationship  underlying  the 
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data .  The  data  base  is  then  used  to  estimate  the  parameters  of  the  functional 
relationship  and  a  CER  is  obtained.  Whenever  the  functional  relationship 
is  linear  or  can  be  transformed  into  a  linear  form,  a  least  squares  curve¬ 
fitting  technique  is  generally  used  to  estimate  the  parameters.  At  this 
point  the  analyst  may  examine  any  of  a  number  of  measures,  or  statistics, 
based  on  linen,  regression  theory  to  assess  the  goodness  of  the  resulting 
fit.  If  the  fit  is  judged  good  the  analyst  uses  the  CER  as  the  basis  for 
cost  prediction,  concluding  that  it  represents  the  cost  generating  process 
of  the  class  of  systems  being  analyzed.  The  assumption  operating  here  is, 
in  effect,  that  which  fits  the  data  best  predicts  best. 

While  no  necessary  relationship  to  the  system's  cost  generating 
process  is  thus  established,  a  good  fitting  CER  can  be  meaningfully  used 
to  make  cost  predictions,  particularly  when  the  desired  prediction  is  an 
interpolation  within  the  framework  of  the  data  base.  But  cost  analysts 
often  deal  in  extrapolations.  New  systems  are  generally  bigger,  or  faster, 
or  newer  in  some  combination  of  physical  or  performance  characteristics, 
and  so  fall  outside  existing  data.  Hence,  to  predict  the  cost  of  0 
future  procurement,  the  cost  analyst  is  often  required  to  extrapolate  from 

the  past  data  base. 

historical  Simulation  extracts  information  from  the  data  base  on 
how  well  a  cost  estimating  procedure”  has  performed  similar  extrapolations. 
However,  historical  Simulation  cannot  guarantee  (any  more  than  regression 
techniques  can)  that  an  apparently  valid  cost  estimating  procedure  can 
predict  accurately  a  future  procurement,  as  the  cost  generating  process 
underlying  this  procurement  may  have  drastically  changed  from  the  one 
underlying  similar  objects  already  produced.  What  is  unique  about 
historical  Simulation  is  that  in  evaluating  the  candidate  cost  estimating 


The  functional  form 


together  with  the  technique  for  picking  the  parameters 
i  - l  .:  ~  the  functional  form  together 


as  distinct  from  a  particular  CER  which  is 
with  estimated  parameters. 
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procedure  it  directly  uses  the  cost  analysts  goal — predicting  costs  of 
future  objects  using  past  data  on  similar  objects. 

Ihe  premise  underlying  Historical  Simulation  lies  in  the  observation 

that,  if  the  hypothesized  functional  relationship  represents  the  cost 

generating  process,  and  if  the  parameter  estimating  technique  is  valid, 

then  the  estimating  procedure's  validity  can  be  demonstrated  by  simulating 
k 

predictions  that  might  have  been  made  using  it  throughout  the  time  period 
of  the  data  base.  The  resulting  predictions  can  then  be  compared  with 
actual  data.  Thus  an  analyst  can  test  his  estimating  procedure  by  using 
some  of  his  data  to  simulate  a  prediction  of  a  later  data  point.  If  such 
simulated  predictions  yield  consistently  acceptable  predictions,  his 
confidence  in  the  estimating  procedure's  ability  to  predict  future 
procurements  is  greatly  bolstered,  even  if  the  future  procurement  lies 
outside  the  data  base. 

In  contrast  to  the  that-which-fits-best-predicts-best  rationale  of 
linear  regression  theory,  the  assumption  implicit  in  the  Historical 
Simulation  approach  is  that  which  simulates  its  ability  to  predict  best 
will  continue  to  predict  best.  This  conceptual  difference  will  provide 
the  five  advantages  listed  below: 

1.  The  past  abi] ity  of  candidate  cost  estimating  procedures  to 
extrapolate  from  historical  data  can  be  demonstrated. 

2.  Evaluations  made  using  Historical  Simulation  constitute 
additional  information  useful  in  hypothesizing  new  cost 
estimating  procedure  candidates. 


The  use  of  the  word  prediction  in  the  Historical  Simulation  context,  may 
or  may  not  have  the  usual  meaning.  If  the  candidate  cost  estimating  proced¬ 
ure  is  hypothesized  independent  of  the  data  base,  then  the  simulated 
predictions  are  in  fact  predictions.  But  in  the  most  typical  case,  when 
the  candidate  cost  estimating  procedure  is  hypothesized  after  examining 
the  entire  data  base,  the  simulated  predictions  cannot  be  interpreted 
as  actual  predictions,  for  the  candidate  estimating  procedure  undoubtedly 
fits  the  entire  sample  well. 
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3.  Historical  Simulation  can  compare  a  wider  class  of  cost 
estimating  procedures  than  the  usually  employed  regression 
techniques  . 

4.  CERs  derived  from  stepwise  multiple  regression  techniques  can 
readily  be  tested,  thus  providing  an  independent  evaluation 
of  them. 

5.  Evaluations  made  using  Historical  Simulation  yield  an  easy- 
to-communicate  summary  statistic  that  is  useful  in  describing 
the  accuracy  of  a  prediction. 

To  summarize,  the  conceptual  differences  between  Historical  Simula¬ 
tion  and  Regression  Theory  insure  that  the  former  will  give  the  cost  analyst 
new  information  from  which  he  can  judge  the  reliability  and  validity  of 
hypothesized  cost  estimating  procedures.  Hence  Historical  Simulation  is 
not  a  replacement  of  the  traditional  Regression  Theory  techniques;  rather 
it  is  another  tool  which  the  analyst  can  use. 

B.  ORGANIZATION  OF  THE  REPORT 

This  report  is  presented  in  two  volumes  of  which  this  is  the  first. 

The  second,  subtitled  Seme  Examples,  presents  the  results  of  applying  the 
Historical  Simulation  technique  to  two  aircraft  samples.  While  the 
author  is  not  sufficiently  familiar  with  the  data  to  draw  concrete  con¬ 
clusions  about  which  estimating  procedure  is  best,  the  results  are 
useful  in  demonstrating  the  value  of  Historical  Simulation.  Volume  II 
carries  a  Confidential  classification. 

Volume  I  completely  describes  the  Historical  Simulation  technique, 
and  presents  related  background  material  about  current  estimating 
techniques.  It  is  in  five  sections,  of  which  this  Introduction  is  the  first, 
and  has  three  appendixes. 

Section  II  is  in  large  measure  devoted  to  background  material,  and 
outlines  the  considerations  and  problems  that  have  led  to  the  development 
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of  Historical  Simulation.  It  is  concluded  by  listing  some  of  the  properties 
that  would  be  desired  of  any  new  evaluation  procedure. 

Section  III  describes  the  Historical  Simulation  procedure  in  detail, 
demonstrating  its  use  with  a  hypothesized  linear  cost  estimating  procedure, 
ana  utilizing  a  least  squares  fitting  technique  to  estimate  the  parameter 
values.  It  is  then  generalized  to  a  wider  class  of  estimating  procedures 
and  some  of  its  properties  are  discussed. 

Section  IV  discusses  three  of  the  ways  the  outputs  provided  by  His¬ 
torical  Simulation  can  be  utilized.  These  three  ways,  or  categories,  are 
(1)  uirect  examination  of  the  output,  (2)  data  summarizations  that  do  not 
depend  on  a  particular  cost  estimating  procedure,  and  (3)  statistics  which 
utilize  the  assumptions  of  a  particular  cost  estimating  procedure. 

Section  V  concludes  the  body  of  Volume  1  with  a  discussion  of  the 
advantages  and  current  limitations  of  Historical  Simulation,  and 
identifies  some  of  the  directions  future  research  in  the  technique  might 
take . 


The  three  appendixes  contain  topics  of  special  interest.  Appendix  I 
describes  a  computer  model  written  for  Historical  Simulation;  Appendix  II 
derives  the  distribution  of  the  Historical  Simulation  predictions  and 
residuals  under  the  usual  regression  theory  assumptions;  and  Appendix  111 
compares  several  variance  estimators. 

Before  proceeding  to  the  body  ^  f7  the  report,  it  should  be  understood 
that  the  word  simulation,  as  it  is  used  here,  refers  to  the  demonstrating 
of  a  cost  estimating  procedure's  predictive  capability  by  simulating  a 
prediction  that  might  have  been  made  using  only  the  data  that  would  have 
been  available.  Thus  this  procedure  does  not  include  generating  a  random 
sample  needed  for  Monte  Carlo  evaluation — an  integral  feature  of  many  simu¬ 
lation  models. 
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In  addition,  while  the  present  work  is  tailored  to  the  cost  problem, 
no  limitation  is  evident  that  precludes  using  Historical  Simulation  to 
evaluate  any  estimating  procedure,  particularly  when  the  inference  to  be 
made  has  the  characteristics  of  extrapolation  and  small  sample  size. 
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II.  BACKGROUND  TO  THE  DEVELOPMENT  OF  HISTORICAL  SIMULATION 

A.  ADVANTAGES  OF  FUNCTIONAL  COST  MODELS 

In  recent  years,  major  procurement  and  force  decisions  in  the  Depart¬ 
ment  of  Defense  have  been  made  with  the  help  of  Systems  Analysis,  a 
management  tool  in  which  alternative  weapon  systems  capable  of  accomplishing 
the  same  objective  are  compared  analytically.  The  alternatives  are  most 
often  described  in  terms  of  general  performance  characteristics.  Thus  a 
bomber  might  be  described  by  its  speed  (Mach  1.2,  say),  range  (1500 
nautical  miles),  and  payload  (18,000  pounds). 

Before  the  various  alternatives  can  be  compared,  estimates  of  each 
system's  cost  and  effectiveness  must  be  made.  From  these  estimates  the 
"best"  alternative  can  be  selected  or  new  alternatives  specified  and  the 
process  repeated. 

Traditionally,  cost  estimates  have  been  based  on  detailed  engineering 
evaluations  of  the  weapon  system  alternatives.  Indeed  this  process  is  still 
used,  particularly  in  industry,  when  the  comparisons  being  made  concern 
the  detailed  design  decisions  necessary  to  achieve  the  specified  weapon 
system  characteristics  (in  the  most  economical  fashion)  .  For  example,  What 
should  be  the  shape  of  the  wing? 

However,  for  making  cost  estimates  to  be  used  in  choosing  the  major 
performance  characteristics  of  the  weapon  system  best  suited  to  a  specific 
mission,  it  has  been  found  that  functional  cos;  .odels  have  several 
advantages  over  the  more  traditional  engineering  approach.  By  including 
in  a  system's  functional  cost  model  all  significant  cost  generating 
performance  characteristics,  the  cost  estimate  will  depend  as  much  as 
possible  upon  the  same  variables  used  to  generate  the  effectiveness 
estimates . 
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In  addition,  functional  cost  models  provide  the  rapid  estimating 
capability  necessary  for  making  timely  comparisons  between  alternate 
weapon  systems  having  widely  varying  performance  characteristics.  Cost 
estimates  generated  in  this  manner,  when  used  in  conjunction  with 
effect  iveness  estimates,  become  an  integral  part  of  the  weapon  system 
performance  characteristic  specification,  rather  than  remaining  the 
result  of  a  more  detailed  evaluation  for  a  particular  weapon  system 
configuration  (which  has  been  chosen  without  regard  to  cost). 

Finally,  a  functional  cost  estimating  procedure  guarantees  a 
consistent  evaluation  of  cost.  This  is  not  usually  the  case  in  engineering 
evaluations  where  cost  definitions  and  accuracies  used  in  the  evaluation 
of  a  particular  alternative  may  differ  from  those  used  in  the  study  of 
another  alternative. 

B.  CURVE  FITTING  AND  REGRESSION  TECHNIQUES  IN  FUNCTIONAL  COST  MODEL 

SPECIFICATION 

At  one  time,  only  curve  fitting  techniques  (such  as  least  squares) 
were  used  to  develop  particular  cost  estimating  relationships  (CERs)  in 
a  functional  cost  model.  But,  by  themselves,  the  fitting  techniques 
would  not  tell  the  analyst  anything  about  the  reliability  of  cost 
estimates  made  using  a  particular  CER,  nor  would  they  help  him  choose  the 
best  from  several  competing  CERs. 

Statistical  regression  techniques  which  essentially  measure  the 
goodness  of  fit  were  introduced  to  answer  these  questions.  Statements 
concerning  predictive  reliability  were  derived  by  using  R-scores  and 
prediction  intervals,  while  choices  between  CERs  with  different  input 
variables  but  the  same  functional  form  (e.g.,  linear)  we’ e  made  using 
F-  and  t-tests. 

There  was,  however,  a  certain  amount  of  trial  and  error  involved  in 
applying  these  regression  techniques.  Candidate  CERs  had  to  be  specified, 
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and  often  the  results  of  applying  the  regression  techniques  were  such 

that  none  of  the  CERs  were  acceptable.  A  computer  routine  called  Stepwise 
2 

Regression  has  been  utilized  by  some  to  eliminate  a  great  deal  of  the 

trial  and  error.  The  analyst  has  only  to  specify  the  candidate  independent 

variables  and  desirable  variable  transformations  (e.g.,  square  root,  squared, 

multiplication  of  two  together,  etc.)  rather  than  to  hypothesize  the 

candidate  CERs.  The  stepwise  routine  can  evaluate  various  linear 

combinations  of  candidate  variables  and  their  transformations  to  derive 
A 

one  of  the  best  linear  combinations  (in  the  sense  of  fitting  the  data 
best)  for  a  specified  number  of  variables.  The  use  of  this  program  will 
be  discussed  further  in  Sec.  II  C  3. 

C.  PROPERTIES  DESIRED  IN  ANY  NEW  EVALUATION  PROCEDURE 

The  application  of  curve  fitting  and  regression  techniques  has  led 
to  several  problems,  four  of  which  are  amenable  to  evaluation  using 
Historical  Simulation.  The  ability  to  deal  with  these  problem  areas  is 
highly  desirable  in  any  new  evaluation  procedure;  each  is  discussed  below 
in  terms  of  the  stated  requirements  that  any  new  evaluation  procedure  should 
have . 


1 .  Needed:  A  Simple  Measure  to  Define  the  Predictive  Capability  of 

&  & 

Candidate  Cost  Estimating  Procedures  or  CERs 

A  problem  in  applying  statistical  regression  techniques  is  that  the 
cost  analysis  application  is  typically  characterized  by  small  sample  sizes. 
Hence  every  attempt  is  made  to  build  up  the  sample  by  including  all  data 
that  is  practically  relevant.  In  so  doing,  however,  the  fulfillment  of 
required  assumptions,  such  as  independence  of  sample  observations,  becomes 


A 

There  has  been  some  discussion  as  to  whether  or  not  the  resulting  linear 
combination  is  the  best.  Step-forward  and  step-backwards  routines  do 
not  always  result  in  the  same  linear  combination  for  K-variables.  For 
further  discussion  see  Ref.  3. 

The  difference  between  these  terms  is  given  in  the  discussion  of 
Property  2. 


UNCLASSIFIED 


9 


UNCLASSIFIED 


doubtful.  For  this  and  other  reasons  the  usual  statistics  ,  interpretation 
of  the  regression  statistics  (i.e.,  F-  and  t-tests,  R-score)  is  open  to 
question;  statements  about  significance  levels  and  prediction  intervals 
may  be  meaningless. 

Fven  when  the  cost  application  does  not  satisfy  the  regression  theory 
assumptions,  however,  it  is  possible  to  use  the  regression  theory  machinery 
to  devise  measures  that  are  free  from  a  statistical  interpretation  and 
have  a  justifiable  "geometric"  interpretation.  Such  a  geometrical 
interpretation  is  described  in  Ref.  4,  pages  13-27.  This  interpreation 
has  had  little  use  since  its  presentation,  probably  because  of  its 
complexity  and  the  lack  of  exact  rules  to  be  applied  in  its  application. 

If  a  simpler,  heuristic  measure  can  be  defined,  one  which  will 
enable  the  analyst  to  choose  among  alternative  CERs  and  to  say  something 
about  the  reliability  of  the  estimate,  there  will  be  no  real  advantage 
in  striving  for  wide  understanding  of  this  geometrical  interpretation. 

Such  a  measure,  called  Average  Proportional  Error,  is  identified  and 
discussed  in  Sec.  IV  b  of  this  report. 

2 .  Needed:  An  Evaluation  Procedure  That  Can  Directly  Compare  a 

broader  Class  of  Candidate  CERs  (Called  Cost  Estimating  Procedures) 
Under  (1)  above  the  question  of  the  meaning  of  the  usual  statistics 
In  the  cost  analysis  application  was  addressed.  Here  attention  is  focused 
on  the  comparability  of  these  statistics.  How  does  one  choose  between 
models  of  different  functional  form,  e.g.,  Y  =  a  +  bX  and  Y  =  aX^? 

2 

One  approach  is  to  use  the  index  of  determination  (or  R  ) .  But 
the  values  of  this  statistic  can  not  be  directly  compared  and  a  model 
choice  based  on  the  index  value  closest  to  one  can  be  very  misleading. 


For  a  discussion  of  the  regression  theory  assumptions  and  the  question 
of  whether  they  are  satisfied  in  the  cost  analysis  application,  see 
Ref.  4,  pages  3-8. 
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To  illustrate  this  point,  examine  the  contents  of  Table  1.  This 

is  the  result  of  running  a  library  computer  program  which  fits  six 

different  curve  forms  (second  column)  in  an  attempt  to  choose  the  best. 

2 

As  can  be  seen  by  evaluating  the  indexes  of  determination  (or  R  's — the 
column  marked  Index),  curve  six  appears  to  be  the  best  choice.  A  print¬ 
out  of  the  table  of  residuals  quickly  dispels  this  notion,  however — the 
fit  in  terms  of  Y  is  lousy  indeed. 

The  problem  is  that  the  indexes  of  determination  are  not  comparable. 
This  is  because  the  index  is  calculated  on  a  least  squares  fit.  But  the 
fit  is  not  applied  until  the  candidate  curve  has  been  transformed  into 
a  linear  form.  For  example  the  linear  form  of  Eq.  6  (Table  1)  is 


1 

Y 


A 


The  fit  criterion  then  is 


(1) 


and  A  and  B  are  picked  to  minimize  this  quantity.  The  index  of 

determination  is  calculated  on  the  linear  fit  and  hence  applies  1/Y  , 

and  not  to  the  quantity  of  interest  Y  .  Hence,  they  should  not  be 
* 

compared . 


In  fact,  this  particular  example  is  a  bad  fitting  technique,  Examples 
of  data  that  fit  Y  =  X/(AX  +  B)  well,  but  do  not  fit  Eq.  1  well,  can 
be  easily  constructed. 
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The  author  is  not  asserting  that  valid  comparisons  for  different 
functional  forme  cannot  be  made.  For  instance,  in  Ref.  5  valid  comparisons 
are  made  for  a  linear  model  and  an  exponential  model,  i.e., 


b  b 
X2 


.  .  X 


But  these  comparisons  are  based  on  either  making  the  statistics  comparable 
or  making  the  parameter  selection  technique  the  same.  In  the  example  of 
Table  1,  however,  care  is  not  taken  to  make  the  index  of  determination 
comparable  even  though  the  parairn  er  selection  techniques  are  different; 
the  curve-fitting  technique  is  first  a  transformation  of  the  equation, 
i.e.,  Y  =  X/AX  +  B  becomes  Eq.  1,  and  then  a  least  squares  curve  fit. 

TABLE  1 

MODEL  COMPARISONS 

XMEAN :  7.4  YMEAN :  114.79 


NUMBER 

CURVE 

INDEX 

A 

B 

1 

Y=A+li*X 

.870942 

1  .47802 

19.0.34 

J 

Y=A*EXP(B*X> 

.  7  14  Hi  9 

12.9491 

.2  38688 

* 

Y»A*X*B 

.  94  189 S 

4 .  984  1 

1.46224 

:t 

Y=A+(B/X) 

.  (>442  78 

1  (>4  .  1  08 

-214.977 

f) 

Y=  l  / ( A+B*X) 

.44971 

.09 J 564 

-8.48198 

(i 

Y=X/(A*X+B) 

.  9824)  7  1 

-1  .  79869  $-2 

.206394 

FOR  WHICH  Cl'KVt  ARK  DETAILS  DESIRED  (NUMBER)  ?  (> 


col  mi:  I  ENTS: 


EXPECTED  VALLE 


941‘CT  CONFIDENCE  LIMITS 


A: 

-1.7  98c>9  $-2 

-2  .  38842  $-2 

-1.20886  $-2 

IS: 

.20»  194 

. 188814 

.  22  19  74 

i-AC’ITAl. 

Y-ACIT'AI. 

Y-ESI'IM 

9  4PCT  CONFIDENCE 

LIMITS 

1 

4 . 2 

4.  10  76  4 

4.9  1682 

9.  7 1871 

J 

1  1 

11  .  7  14/ 

10.9222 

12.6801 

S 

2  1.2 

1 9.  n, “0  7 

18.0428 

2  1  .6447 

♦ 

w.  1 

29.7416 

26.  199  1 

34.079 

> 

7(i .  '• 

42.9  1  1  1 

16 . 2  9 

42.6  38 

III... 

60.9304 

48.0246 

83.3188 

/ 

141.1 

86.9714 

(>2.3619 

143.668 

h 

199.2 

128.002 

80.2099 

3 •6.774 

l  Ii4  .  c> 

202.191 

103.014 

5372.98 

! '.) 

11.7 . 8 

176.996 

1  13.284 

-455.022 

1  i 

169 

1288.27 

1 74.282 

-240.812 

170.4 

-12  70.0/ 

2  17.912 

-172.871 

1  S 

173.8 

-4  73.849 

3  19  .  14  1 

-1  39.516 

1- 

176.1 

-  108.221 

9  36.01 

-1  19.696 
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To  further  clarify  this  distinction  it  is  helpful  to  make  explicit 
the  often-neglected  difference  between  a  CER  and  what  I  have  called  the 
Cost  Estimating  Procedure. 

A  cost  estimating  procedure  consists  of  a  parametric  estimating 
relationship  (PER)  PLUS  a  technique  for  estimating  the  values  of  the 
parameters  (in  the  PER)  from  some  sample.  Thus  an  example  of  an  estimatir 
procedure  might  be: 

Parametric  Estimation  Relationship:  Y  =  a  +  bX 

(Y  is  the  production  cost  of 
the  item  to  be  estimated 

X  is  the  weight  of  the  item 
to  be  estimated 

Technique :  Least  squares  curve  fit 

A  new  estimating  procedure  results  from  choosing  a  new  PER,  a  new 
technique,  or  both  Hence  the  combinations  given  in  Table  2  are  all 
examples  of  alternative  estimating  procedures. 

When  a  cost  estimating  procedure ,  with  PER  Y  =  a  +  bX  ,  snv,  is  used  in 
conjunction  with  a  particular  sample,  (i.e.,  a  particular  set  of  observa¬ 
tions)  there  is  .erived  an  explicit  cost  estimating  relationship  (CER)  , 
for  example,  Y  =  10  +  25X  .  This  is  a  result  of  estimating  the  PER 
parameters  by  applying  the  estimating  technique  to  the  given  sample. 

Thus  every  CER  has  identified  with  it  a  particular 
sample  and  an  estimating  procedure  consisting  of 
a  PER  and  a  technique. 

The  relationship  of  these  entities  is  pictured  in  Fig.  1. 

The  usual  regression  theory  statistics  are  comparable  if  the 
technique  is  the  same  for  all  candidate  estimating  procedures.  In 
particular,  this  is  true  if  the  candidates  have  the  same  PER  form,  as  the 
same  technique  can  easily  be  used.  For  example,  one  can  compare  a  linear 
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TABLE  2 

COST  ESTIMATING  PROCEDURES 


Proced  are 
N  umber 


1 

2 * 

3* 


Y 

Y 


PER 

=  a  +  bV 
=  a  +  bV 


=  a  +  bX 
=  aXb 


Technique 

Least  squares  fit 

Line  determined  by  the  closest  two  data  points 
in  terms  of  V 

Same  as  above  except  closest  measured  in  terms 
of  X 

Least  squares  fit  on  log  Y  =  log  a  +  b  log  X 


Y  =  production  cost,  X  =  weight,  V  =  volume 


Procedures  2  and  3  in  Table  2  may  need  some  explanation.  The  technique 
proposed  is  very  close  to  costing  by  analogy.  In  effect,  the  analyst 
assumes  that  if  he  forms  a  line  with  the  two  closest  data  points  (in 
terms  of  his  independent  variable)  to  the  point  he  wishes  to  predict, 
the  estimate  using  this  line  will  be  bt  tter  than  an  estimate  made  using 
a  line  that  fits  all  the  data. 


Figure  l(L').  Relationship  of  CER  and  Cost  Estimating  Procedure 
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PER  which  has  two  independent  variables  with  one  which  has  three  independent 
variables . 

If  the  PER  forms  are  different,  however,  it  is  not  always  easy  to 
choose  the  same  technique.  Applying  least  squares  directly  to  a  PER  form 
such  as  Eq .  2  requires  the  use  of  expansions  and  iterative  computer 
solutions . ^ 


What  is  needed,  then,  is  an  evaluation  procedure  which  can  compare 
any  cost  estimating  procedures  without  regard  to  whether  or  not  the 
techniques  are  the  same.  As  Sec.  Ill  D  points  out,  Historical  Simulation 
is  such  an  evaluation  procedure. 


3.  Needed:  A  Means  of  Evaluating  Estimating  Procedures  Derived  With 


Help  of  Stepwise  Regression 


Prior  to  the  introduction  of  the  stepwise  regression  technique, 
candidate  CERs  had  to  be  hypothesized,  with  the  hypotheses  presumably 
based  on  engineering  rationales  or  other  criteria.  The  need  for  this 
specification  was  operationally  removed  when  the  stepwise  multiple 
regression  routine  became  available.  Only  the  candidate  variables  and  their 
allowable  transformations  had  to  be  specified.  However,  when  the  stepwise 
routine  was  applied  the  resulting  CER,  while  fitting  the  data  well,  often 
had  no  physical  rationale.  The  applicability  of  the  result  then  became 
questionable,  even  with  a  good  fit.  For  example,  suppose  a  hundred 
different  CER  combinations  are  tried.  It  is  not  surprising  that  one  or 
two  will  fit  well  enough  to  be  judged  significant  at  the  0.05  significance 
level.  This  follows  from  the  fact  that  the  CER  hypothesis  is  not  picked 
a  priori  but  is  the  result  of  finding  the  one  that  fits  the  data  best  from 
a  hundred  linear  combinations;  as  such,  this  fit  could  easily  represent 
one  of  the  five  times  out  of  100  that  such  a  fit  theoretically  occurs  by 
chance  (at  the  0.05  significance  level). 
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With  such  misgivings  concerning  the  results  of  stepwise  regression, 
it  would  he  valuable  to  have  an  evaluation  procedure  which  could  check 
estimating  procedures  derived  by  this  t  <_nniqu~.  It  will  be  shown  in 
Sec.  11 L  l)  that  Historical  Simulation  can  make  this  independent  evaluation 

4 .  Needed:  An  Evaluation  Procedure  Free  From  the  That-Which-Fits-Best- 

Prcdicts-Bcst  Curve-Fitting  Assumption 

The  discussion  above  of  the  third  desired  property  throws  into 
doubt  one  of  the  central  assumptions  of  least  squares  curve  fitting — that 
which  fits  the  pact  beet  will  predict  the  future  best — for  it  is  this 
criterion  that  the  stepwise  regression  procedure  uses  to  choose  CERs. 

A  second  peculiarity  of  the  cost  analysis  problem,  in  addition  to 
small  sample  sites,  casts  further  doubt  on  the  applicability  of  this 
least  squares  curve-fitting  assumption.  While  using  the  criterion  of 
that  which,  fits  'rest,  predicts  best  should  work  reasonably  well  for  cost 
predictions  that  are  interpolations  on  the  characteristics  present  in  the 
data  base,  the  criterion  yields  little  information  concerning  cost 
predict io'.is  of  procurements  which  represent  extrapolations  from  the 
characteristics  in  the  data  base  (see  Ref.  6,  page  6). 

Predicting  the  cost  of  procurements  that  represent  extrapolations 
from  the  data  base  is  precisely  the  problem  that  the  cost  analyst  usually 
faces.  It  seems  like  we  are  always  required  Lo  estimate  the  cost  of  a 
bigger  or  faster  plane,  or  one  that  is  : .  :ur  in  some  combination  of 
charac te r i st ic s  than  those  procured  in  the  past. 

Hence,  a  fourth  desirable  property  for  a  new  evaluation  procedure 
is  that  it  be  independent  of  the  assumption  of  that  whten  fits  best 
:  ■■■  In  addition,  it  will  be  desirable  that  the  evaluation  pro¬ 

cedure  depends  on  how  well  the  candidate  cost  estimating  procedure  can 
extrapolate  from  historical  data.  As  will  be  seen  in  Sec.  Ill  D, 
historical  Simulation  is  such  an  evaluation  procedure. 
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III.  HISTORICAL  SIMULATION  DESCRIPTION 

A.  BASIC  CONCEPT 

The  job  of  a  cost  analyst  is  to  try  to  predict  the  cost  (in  constant 
dollars)  of  a  proposed  future  procurement.  He  has  at  his  disposal  a 
description  of  the  procurement  in  the  form  of  a  set  of  physical  and 
performance  characteristics.  In  addition,  he  has  available  physical  and 
performance  characteristics  as  well  as  cost  data  on  similar  past  procure- 

k 

ments.  Hence,  his  primary  obje.tive  is  the  prediction  of  a  future 
procurement  using  available  historical  data. 

Historical  Simulation  uses  this  primary  objective  in  measuring  the 
value  of  a  cost  estimating  procedure.  This  basic  tenet  can  be  stated  as 
follows : 

The  cost  estimating  procedure  which  can  best 
simulate  predictions  that  would  have  been  made 
in  the  past  will  actuall;,  be  best  able  to 
predict  the  future. 

B .  AN  EXAMPLE 

To  evaluate  different  cost  estimating  procedures,  using  the  tenet 
just  stated,  Historical  Simulation  calls  for  each  candidate  cost 
estimating  procedure  to  be  tested  on  subsamples  of  the  actual  data  base. 
For  each  subsample,  the  candidate  cost  estimating  procedure  is  used  to 
predict  the  cost  of  procurements  built  after  any  of  the  procurements  in 
the  subsample.  These  predictions  are  then  compared  to  the  actual  costs. 


It  will  be  assumed  that  the  cost  data  is  in  constant  dollars  and  pel  tains 
to  some  production  quantity,  like  the  hundredth  unit. 
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To  demonstrate  this  process,  consider  the  following  example  comprising 

A 

the  thirteen  data  points  listed  in  Table  3.  The  data  has  been  ordered 
as  to  date  of  first  delivery  (second  column),  and  the  actual  cost  and  the 
independent  variables  X^  and  have  been  collected  for  each  data 

point;  X^  and  X.;  are  physical  or  performance  characteristics  (such 
as  weight  and  speed)  which  we  hope  will  be  useful  in  specifying  the  cost 
of  the  procurements  we  are  to  estimate.  We  have  hypothesized  the  follow¬ 
ing  cost  estimating  procedure: 


Cost  =  a  +  b1X1  +  b2X2  (3) 

where  a,  ,  and  b0  are  to  be  estimated  through  the  process  of  a 
least  squares  curve  fit. 


TABLE  3 
SAMPLE  DATA 


rocurement 

X  umb e  r 

Fi  rst 
Deliver) 

Actual 

Unit  Cost 

X1 

X2 

1 

1950 

95 

1,996 

153 

> 

1951 

31 

967 

144 

3 

1953 

60 

2,414 

149 

•4 

1954 

82 

4,418 

144 

3 

195b 

25 

852 

107 

b 

19  53 

67 

2,072 

136 

/ 

I960 

24  3 

10,408 

177 

8 

19b  1 

54 

2,643 

160 

9 

1 9b  2 

112 

3,786 

172 

10 

19b  3 

106 

3,335 

203 

1  1 

I9b4 

183 

6,374 

196 

12 

19b  5 

156 

7,092 

187 

1  3 

19b  7 

177 

10,304 

167 

lli i s  data  was  used  to  debug  the  Historical  Simulation  computer  program 
described  in  Appendix  I.  Values  for  Tables  3  through  8  were  obtained 
from  the  output  of  this  program  as  reproduced  in  Table  20  of  Appendix  1. 
1  he  data  used  does  not  represent  any  real-world  sample  but  is  used  only 
to  illustrate  tiie  Historical  Simulation  procedure. 
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Now  suppose  we  start  with  a  subsample  of  five  items;  that  is  we  will 
treat  the  first  five  rows  of  Table  3  as  our  data  base.  This  is  the  data 
base  from  which  an  analyst  would  have  had  to  make  cost  predictions  in 
1957.  Using  a  least  squares  fit,  the  derived  CER  is 

Cost  =  -73.9  +  0.0104X1  4  0./92X2  (4) 

From  Table  3,  X^  and  X2  for  procurement  number  6  are  2072  and 
136.  If  these  values  are  substituted  into  the  CER  of  Eq .  4,  the  predicted 
cost  is  55.3.  From  Table  3  the  actual  cost  was  67;  thus  we  have  underesti¬ 
mated  by  11.7. 

Next,  Eq .  4  can  be  used  to  predict  the  remaining  data  points  7-13. 
These  predictions  can  be  compared  to  the  actual  costs,  and  residuals 
calculated,  yielding  the  results  given  in  Table  4.  As  one  can  see  there 
were  six  underestimates  and  two  overestimates. 

The  entire  process  described  thus  far  is  now  repeated  for  a  subsample 
size  of  six.  That  is,  we  add  the  sixth  procurement  to  our  subsample, 
taking  the  six  top  rows  of  Table  3  as  our  data  base.  This  data  base  is 
the  one  from  which  a  cost  analyst  would  have  made  his  cost  prediction  in 
1959.  Making  a  least  squares  fit  to  this  data  base  we  obtain  the 
following  CER: 

Cost  =  -68.4  4  0.0105X^  +  0.765X,  (5) 

Comparing  Eqs .  (5)  and  (4)  we  see  that  the  parameters  have  changed, 
although  not  by  any  great  amount.  This  change  is,  of  course,  the  result 
of  adding  procurement  number  6  to  the  sample.  I  he  point  to  he  remembered 
is  that  the  explicit  CER  has  changed,  but  the  CER  form,  i.e., 

Cost  =  a  4  b^X  4  b^X^  ,  and  the  parameter  estimating  technique,  namely, 
least  squares,  has  not  changed.  It  is  the  CER  form  and  the  parameter 
estimating  technique  that  are  being  evaluated  by  Historical  Simulation, 
and  not  any  one  explicit  CER  such  as  Eq .  5. 
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TABLE  4 

PREDICTED  COSTS  USING  FIRST  FIVE  PROCUREMENTS 


Procurement 

Number 

Actual 

Unit  Cost 

Predicted 

Cost 

k 

Residual 

6 

67 

55.3 

-11.7 

7 

243 

174.4 

-68.6 

8 

54 

80.2 

26.3 

9 

112 

101.6 

-10.4 

10 

106 

121.5 

15.5 

11 

183 

147.5 

-35.5 

12 

156 

147.9 

-8.1 

13 

177 

165.4 

-11.6 

Negative  numbers  are  underestimates;  positive  numbers  are  overestimates. 


Predictions  and  residual  calculations  for  procurements  7-13  can 
now  be  made  using  Eq .  3  yielding  the  results  shown  in  Table  5.  Notice 
that  procurement  numbei  6  is  not  included  since  it  was  part  of  the  daca 
base  used  to  derive  Eq .  5. 


TABLE  5 


PREDICTED  COSTS  USING  FIRST  SIX  PROCUREMENTS 


i’roc  uremen  t 

N  urn be r 

Ac  t  ual 

Unit  Cost 

Predicted 

Cost 

Residual 

7 

24  3 

176.1 

-66.9 

8 

3-t 

81.7 

27.7 

9 

112 

102.8 

-9.2 

10 

1U(> 

121.8 

15.8 

1  1 

183 

148.3 

-34.7 

12 

1 5o 

149 .0 

o 

1 — 

1 

1  J 

177 

►— * 

-9.6 
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The  procedure  described  thus  far  can  be  repeated  using  subsample 
data  base  sizes  of  7,  8,  and  on  up  to  13,  In  the  last  case  the  entire 
sample  is  usee  atid  the  usual  least  squares  fit  is  obtained.  Of  course, 
no  predictions  for  which  an  actual  cost  exists  in  the  data  base  can  be 
made  using  this  final  CER.  However,  this  is  the  CER  which  will  be  used 
to  make  future  predictions  if  the  PER  and  parameter  estimating  technique 
being  evaluated  by  Historical  Simulation  is  chosen  as  a  good  method  for 
predicting  cost. 

The  outputs  described  can  be  conveniently  summarized  in  a  table  of 
predictions  (Table  6),  a  table  of  residuals  (Table  7),  and  a  table  of 
parameter  estimates  (Table  8).  The  interpretation  of  this  output  will 
be  discussed  in  Sec.  IV. 


A  word  of  caution  must  be  inserted  at  this  point.  Ihe  results  of 
this  particular  example  as  displayed  in  Tables  6,  7,  and  8  are  merely 
illustrative.  Their  purpose  is  simply  to  make  explicit  the  Historical 
Simulation  procedure  and  the  form  of  the  output.  Results  of  a  limited 
number  of  Historical  Simulation  runs  (using  the  computer  program  described 
in  Appendix  I)  are  presented  in  Volume  2  (CONFIDENTIAL)  for  some  airciaft 
data.  They  were  excluded  from  the  present  volume  to  avoid  the  necessity 

of  classifying  it. 


Some  of  the  possible  ways  of  analyzing  these  results  are  discussed 
in  Sec.  IV,  but  it  must  be  remembered  that  Historical  Simulation  is 
intended  primarily  as  a  tool  for  evaluating  an  estimating  procedure. 
Hopefully,  the  evaluation  will  be  made  in  the  presence  of  other  candidates. 
Only  the  analyst  who  understands  his  data  base  can  make  such  judgements 

as  to  whether 

•  The  results  are  reasonable,  and  the  estimating  procedure  is 

valid,  or 


The  results  are  not  reasonable  and  a  new  estimating  procedure 
should  be  hypothesized,  and/or  the  sample  should  be  stratified 
_ i.e.,  divided  into  groups  which  seem  to  come  from  different 

populations . 
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TABLE  6 


PREDICTIONS 

For  Sample  Point  Number 


Sample 
Size  Used 

b 

7 

8 

9 

10 

11 

12 

13 

5 

55.3 

174  .4 

80.2 

101.6 

121.5 

147.5 

147.9 

165.4 

b 

17b. 1 

81.7 

102.8 

121.8 

148.3 

149.0 

167.4 

7 

85.4 

114.3 

128.1 

177.4 

183.9 

227.0 

8 

102.1 

103.9 

161.5 

172.6 

229.3 

9 

110.7 

166.3 

176.1 

229.2 

10 

164.5 

174.9 

229.8 

11 

179.7 

223.7 

12 

227.1 

13 

1 

TABLE  7 
RESIDUALS 
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TABLE  8 

ESTIMATED  PARAMETERS 


Sample  Size 

a 

bl 

b2 

5 

-73.9 

0.0104 

0.792 

6 

-68.4 

0.0105 

0.765 

7 

-74.6 

0.0178 

0.706 

8 

-31.8 

0.0198 

0.344 

9 

-45.2 

0.0193 

0.450 

10 

-38.6 

0.0196 

0.400 

11 

-50.2 

0.0198 

0.478 

12 

-44.6 

0.0191 

0.448 

13 

-63.9 

0.0159 

0.629 

SUMMARIZATION  OF 

THE  PROCEDURE 

This  summarization,  or  generalization  of  Historical  Simulation 
is  presented  in  the  language  of  the  estimating  procedures  introduced  in 
Sec.  II  in  order  to  make  it  apparent  that  Historical  Simulation  can  be 
used  on  any  estimating  procedure.  (This  was  the  second  desirable 
property  stated  in  Sec.  II.) 

Let  the  estimating  procedure  being  examined  have  a  PER  given  by 

y  =  f(6,  X)  (6) 

where  y  is  the  cost 

8  are  the  parameters  of  the  function 

—Y 

and  X  are  independent  variables 

—Y 

In  the  example  given  in  Sec.  Ill  B,  8  represents  the  parameters  a,  b^  , 

~y 

and  b£  5  X  the  independent  variables  X^  and  X^  ;  and  f  the  linear 
equation  given  by  Eq .  3.  To  complete  the  estimating  procedure  specification, 
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there  is  a  technique  T  which,  when  applied  to  a  sample,  yields  an 

estimate  of  the  parameters  fci  .  In  the  example  of  Sec.  Ill  B  the  technique 
T  was  least  squares  curve  fit. 

The  sample  consists  of  N  sets  of  data  , 

i=l,2,  ...,N,  where  the  y^  are  the  actual  cost  of  procurement 
i  ,  and  the  are  the  values  of  the  independent  variables  for  procure¬ 

ment  i  .  It  is  assumed  that  the  sample  has  been  ordered  in  time,  with 
the  smaller  values  of  i  corresponding  to  the  older  data  points. 

The  Historical  Simulation  procedure  can  then  be  summarized  as  an 
iterative  process  which  goes  through  the  following  four  steps  at  each 
iteration . 

Step  1.  Subsample  Specification:  Determine  data  base  size  n  for 

this  iteration,  where  n  is  larger  than  the  subsample  size  of  the 

previous  iteration.  in  particular  n  <  n  <  N  where  n  is  some 

minimum  sample  size  which  is  greater  than  the  number  of  PER 

parameters,  i.e.,  entries  in  B  .  In  the  case  of  the  example, 

n  4  as  there  are  three  parameters  to  estimate:  a,  b,  ,  and 
o  —  I 

b2‘ 

Step  2.  CER  Specification :  Apply  the  estimating  procedure  technique 
T  to  t lie  subsample  of  size  n  identified  in  Step  1,  i.e., 

(y  ,X  )  ;  i  =  1,  2,  .  .  .  ,  n  ,  and  obtain  the  PER  parameter 

estimates  .  (In  the  example  of  the  last  section,  least  squares 

n 

estimates  of  a,  bj  ,  and  b ,,  were  made  for  each  iteration.) 
Substituting  these  parameter  estimates  into  the  PER  yields  the  CER 
for  this  iteration.  it  can  be  denoted  by 
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Step  3.  Cost  Prediction:  Predict  the  cost  of  each  of  the  procure¬ 
ments  not  included  in  the  subsample.  This  is  accomplished  by 
substituting  the  values  of  the  independent  variables  (for  the 
procurement  in  question)  into  the  CER  developed  in  Step  2.  Predictions 
are  made  of  y  k  =  1,  2,  .  .  .  ,  N-n  .  These  predictions  are 
labeled  in  the  remainder  of  this  report  and  are  given  by 


^(n)  = 
n+k 


k  =  1,  2 


N-n 


(7) 


i  *  (n) 

where  y  is  th 
Jn+k 

— + 

the  X  are  the 
n+k 

n+kth  procerement, 
in  Table  6.) 


e  prediction  of  y  from  subsample  size  n  and 

values  of  the  independent  variables  for  the 
(For  the  example  these  predictions  were  listed 


Step  4.  Calculation  of  the  Residuals:  The  actual  costs  are  sub¬ 
tracted  from  the  appropriate  predictions  (Step  3)  and  the  residuals 
obtained.  These  residuals,  denoted  by  d^j^  >  are  given  by 


=  yM  _  y 
n+k  yn+k  yn+k 


(8) 


,,  r  ,(n) 

Negative  values  of  d  ,, 

n+k 

values  are  overestimates 
last  section  were  given 


represent  underestimates  while  positive 
.  (The  residuals  for  the  example  of  the 
in  Table  7.) 


A  few  remarks  shoul ..  oe  made  concerning  Step  ],  Subsample  Specifica¬ 
tion  .  For  the  purposes  Historical  Simulation,  several  data  points 
procured  in  the  same  time  frame  can  be  grouped  together.  For  instance, 
if  data  points--procurements — 7,  8,  and  9  were  all  delivered  in  the  same 
year,  one  can  group  this  data.  Iterations  of  the  Historical  Simulation 


Grouping  will  have  no  effect  on  the  Historical  Simulation  evaluation  with 
the  exception  of  those  statistics  discussed  in  Sec.  IV  C  2  which,  at 
present,  are  valid  only  for  the  one-step  residuals  d(a,j. 
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would  include  subsamples  5,  6,  9,  10,  11,  12,  and  13.  Predictions  of 
data  points  7,  8,  and  9  would  only  be  made  with  subsamples  of  five  and 
six  data  points.  Information  concerning  data  points  7  and  8  would  not  have 
been  available  for  the  prediction  of  data  point  9,  so  grouping  the  data 
does  not  invalidate  the  Historical  Simulation  procedure. 

Another  problem  in  subsample  specification  is  selecting  the  initial 
subsample.  A  lower  bound  exists  that  is  dictated  by  the  number  of 
parameters  to  be  estimated.  For  the  example  in  this  section  the  lower 
bound  would  be  four  (one  greater  than  the  number  of  parameters  as  required 
for  a  finite  variance  least  squares  fit).  But  this  selection  of  four 
subsample  items  would  allow  only  one  degree  of  freedom  and  one  would  expect 
a  great  deal  of  variation  in  the  predictions.  Using  too  large  an  intial 
sample,  however,  will  greatly  reduce  the  amount  of  new  information 
contained  in  Tables  (>,  7,  and  8.  The  initial  subsample  size  must  thus 
he  set  by  the  analyst  at  the  smallest  number  which  is  necessary  for  the 
estimating  procedure,  if  valid,  to  have  enough  information  from  which  to 
make  reasonable  estimates.  (In  the  example  nQ  was  arbitrarily  chosen 
to  be  5)  . 

U.  SOME  PROPERTIES 

Several  properties  of  the  historical  Simulation  procedure  can  be 
established  from  the  development  made  thus  far.  For  instance,  the 
procedure  evaluates  a  candidate  cost  estimating  procedure  by  simulating 
how  well  the  latter  would  have  predicted  if  it  had  been  available  and 
used  to  make  cost  estimates  in  the  past.  Hence  the  name  Historical 
S imul ation . 

Historical  Simulation  does  not  depend  on  the  usual  curve  fitting 
assumption  of  goodness  'J'lijti  j'iis  best,  pvedijts  best.  (This  was 

identified  as  desirable  property  number  4  for  a  new  evaluation  procedure 
in  Sec.  II  C).  The  freedom  from  the  curve  fitting  assumption  is  a 
consequence  of  the  fact  that  the  output  in  lables  6  and  7  depends  only 
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on  how  well  the  hypothesized  cost  estimating  procedure  predicts .  The 
entries  do  not  depend  on  how  well  a  particular  CER  fits  the  subsample  that 
was  given  to  it. 


Historical  Simulation  can  be  used  to  evaluate  CERs  derived  with  the 
help  of  Stepwise  Multiple  Regression  programs,  whose  use  was  discussed 
briefly  in  Sec.  II.  Using  this  program,  the  choice  of  a  CER  is  determined 
by  which  candidate  CER  fits  the  data  best  (in  a  least  squares  sense)  . 
Unfortunately,  the  values  of  the  usual  regression  statistics  depend  on  this 
choice  criterion  and  are  thus  not  independent  of  the  CER  selection  process. 

In  contrast,  Historical  Simulation  does  not  depend  on  the  choice  criterion 
as  its  output  does  not  depend  on  how  well  the  CER  fits.  In  other  words, 
Historical  Simulation,  unlike  the  usual  regression  statistics,  is  able 
to  evaluate  the  CER  independently  of  the  stepwise  regression  choice  criterion. 
(This  property  was  identified  as  desirable  property  number  3  for  a  new 
evaluation  procedure  in  Sec.  11  C.) 

Due  to  its  dependence  on  predicting  from  past  data,  Historical 

Simulation  is  a  tool  to  demonstrate  the  estimating  procedure's  ability  to 

handle  extrapolations  implicit  in  the  data  base.  This  is  in  contrast  to 

the  estimating  procedure's  ability  to  interpolate,  which  can  be  evaluated 

by  the  usual  regression  theory  approach.  The  extrapolation  is  in  the  time 

direction  as  the  data  is  ordered  on  time.  Indeed  this  is  probably  the 

most  universal  ordering  as  it  will  tend  to  parallel  orderings  on  physical 

characteristics.  This  is  because  new  procurements  usually  represent 

advancements  in  the  state  of  the  art,  as  measured  by  some  set  of  physical 

characteristics.  Hence  ordering  on  time  will  also  tend  to  order  on  these 

Vc 

physical  characteristics. 


* 

It  should  be  noted  that  there  may  be  applications  in  which  the  advancement 
implicit  in  a  new  procurement  is  represented  by  an  increase  in  one  physical 
characteristic,  say  bandwidth.  The  problem  then  would  be  to  estimate  the 
cost  of  this  new  procurement,  from  a  data  base  of  procurements  which  all 
have  smaller  bandwidths.  The  extrapolation  then  would  be  in  the  bandwidth 
direction  and,  in  this  case,  the  author  sees  no  reason  why  the  ordering 
could  not  be  on  bandwidths. 
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Another  difference  between  Historical  Simulation  and  the  usual  curve 
fitting  techniques  is  that  the  former  looks  at  different  samples  while 
the  latter  concentrates  on  the  entire  historical  sample.  In  effect  the 
Historical  Simulation  procedure  looks  at  how  well  the  hypothesized  CER 
form  does  at  varying  times  and  hence  how  reliable  the  hypothesized  CER 
is  over  time.  In  contrast,  the  curve  fitting  techniques  and  the  associated 
regression  statistics  evaluate  one  period  in  time,  the  present,  and  will 

k 

in  general  be  unable  to  detect  time-trend  effects. 

Finally,  Historical  Simulation  can  be  used  to  directly  compare  any 
candidate  cost  estimating  procedures.  (Identified  in  Sec.  II  C  as 
desirable  property  (2)  for  a  new  evaluation  procedure.)  This  is  quite 
apparent  from  the  fact  that  the  summarization  of  the  procedure  in  the 
last  section  was  carried  out  in  estimating  procedure  notation.  All  that 
is  needed  is  a  PER,  Eq.  7,  and  a  parameter  estimating  technique  T. 

Having  defined  the  Historical  Simulation  procedure  and  some  of  its 
properties  and  seen  how  it  works  for  a  particular  example,  attention  must 
now  be  focused  on  the  output  of  Historical  Simulation.  What  is  it  good 
for  and  how  does  one  interpret  it?  These  questions  will  be  addressed  in 
the  following  section. 


This  statement  is  not  universal  because  time  has  sometimes  been  included 
explicitly  in  the  CER  form. 
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IV.  OUTPUT  INTERPRETATION 

In  trying  to  interpret  the  results  of  Historical  Simulation  (or  indeed 
to  make  inferences  from  the  usual  regression  statistics) ,  the  analyst  is 
trying  to  examine  two  basic  questions  about  the  cost  estimating  procedure 
under  study: 

1.  Is  the  estimating  procedure  valid?,  i.e.,  is  it  a  true 
representation  of  the  cost  generating  process  under  study? 

2.  How  reliable  is  the  estimating  procedure? ,  i.e.,  is  the  model 
variance,  and  hence  the  variance  in  estimates,  large  or  small? 

Insights  into  the  answers  to  these  questions  are  used  by  the  analyst  to 
choose  between  different  candidate  cost  estimating  procedures  (ranking) , 
to  define  new  candidate  cost  estimating  procedures,  and  to  make  statements 
about  the  accuracy  of  his  predictions. 

The  value  of  the  Historical  Simulation  procedure  must  be  directly 
related  to  the  usefulness  of  its  output  as  a  means  of  providing  insights 
into  these  two  basic  questions  and  helping  the  analyst  make  the  choices 
and  statements  identified  above.  Ways  of  using  the  Historical  Simulation 
output  for  these  purposes  are  discussed  in  this  section.  The  discussion 
has  been  organized  into  the  following  three  categories: 

1.  Direct  Examination  of  the  Historical  Simulation  Output  (Sec.  IV  A) 

2.  Data  Summarizations  That  do  not  Depend  on  a  Particular 
Estimating  Procedure  (Sec.  IV  13) 

3.  Statistics  Which  Depend  on  a  Particular  Estimating  Procedure 
(Sec.  IV  C) 

A.  DIRECT  EXAMINATION  OF  THE  HISTORICAL  SIMULATION  OUTPUT 

A  direct  examination  of  the  contents  of  the  output  tables  of  Sec.  Ill 
(Tables  6,  7  and  8),  can  add  insight  into  the  question  of  model  validity, 
the  identification  of  questionable  sample  points,  and  the  identification 
of  new  candidate  estimating  procedures.  In  the  course  of  this 
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examination  several  useful  questions  can  be  asked;  these  are  discussed 
below  making  use  of  the  form  of  the  residual  table  (Table  9)  which  is 
patterned  after  Table  7  of  Sec.  III. 


TABLE  9 

FORM  OF  RESIDUAL  TABLE 
(X  stands  for  a  residual  value  calculation) 


l.  Each  coLumn  of  Table  9  gives  the  residuals  for  a  particular 
sample  point.  One  can  ask  if  these  residuals  are  improving 
--getting  smaller  in  an  absolute  sense — as  the  sample  size 
grows  (that  is,  as  the  analyst  looks  down  the  column).  One 
would  expect  the  residuals  to  improve — or  at  least  not  get 
any  worse  —  if  the  model  is  valid  and  the  sample  consistent. 

In  fable  7  we  saw  that  this  behavior  is  not  true  for  the  test 
run  sample.  The  residuals  are  erratic  or  tend  to  get  worse 
for  sample  points  9,  11,  12  and  13. 

J.  Are  there  any  consistent  errors?  For  example,  does  the 

estimating  procedure  underestimate  (have  negative  residuals) 
most  sample  points  consistently?  If  so,  then  the  cost 
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estimating  procedure  shows  signs  of  bias.  Again  by  examining 
any  column  of  Table  9,  one  might  find  sample  points  that  are 
consistently  under-  or  over-estimated  by  a  substantial  amount. 
In  this  case  there  is  reason  to  suspect  that  the  data  point 
in  question  does  no';  belong  to  the  population,  or  that  errors 
have  been  made  in  recording  its  cost  or  the  values  of  the 
independent  variables. 

For  the  test  run  data  of  Table  7  there  appears  to  be  no 
indication  of  bias  as  the  residuals  are  neither  mostly  negative 
or  mostly  positive.  There  are  sample  points,  however,  that 
show  substantial  consistent  errors,  such  as  points  7  and  11. 

Residuals  along  any  row  of  Table  9  are  all  derived  from  the 
same  subsample.  Comparing  two  adjacent  rows  indicates  the 
impact  on  the  prediction  process  of  the  points  added  to  the 
larger  subsample.  One  might  therefore  ask  if  there  have  been 
significant  changes,  in  some  consistent  manner,  from  one  row 
to  the  next.  If  so,  the  sample  point  added  is  dominating  the 
estimating  procedure  and  if  the  changes  in  residuals  are  not 
for  the  better  (i.e.,  smaller  absolute  residuals)  then  the 
question  of  whether  or  not  the  sample  point  properly  belongs 
to  the  population  is  again  raised. 

As  an  example,  if  rows  for  subsample  sizes  of  six  and  seven 
data  points  are  compared  in  Table  7,  we  see  substantial 
changes  in  the  residuals.  While  some  residuals  have  improved 
— sample  points  9  and  11 — others  have  definitely  become  worse 
— sample  points  12  and  13.  There  is  no  question  that  sample 
point  7  has  had  a  significant  impact,  but  its  impact  is  mixed. 

Finally,  the  estimates  of  the  parameters  (Table  8)  can  be 
examined.  Are  they  reasonably  stable,  showing  signs  of 
convergence  as  the  sample  size  grows?  If  so,  then  one  feels 
a  greater  assurance  of  the  model's  validity;  the  information 
concerning  the  values  of  the  model  parameters  is  essentially 
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the  same  from  all  the  sample  points.  If  not,  then  there  might 
be  something  in  the  pattern  of  the  estimated  coefficients  that 
would  suggest  a  new  candidate  cost  estimating  procedure  or 
that  would  identify  a  questionable  sample  point. 

In  Table  8  it  can  be  seen  that  the  desired  stability  did  not 
take  place  for  the  test  run  data.  The  inclusion  of  sample 
points  7  and  8  had  a  significant  impact  on  the  parameter 
estimates  to  the  subsample  7  values.  Hence,  these  points  ough 
to  be  examined  carefully. 

In  summary,  there  is  a  great  deal  of  "look-see"  evidence 
concerning  the  model  validity  in  the  output  of  Historical 
Simulation.  This  output  can  be  used  to  build  confidence  in 
model  validity  or,  conversely,  aid  in  hypothesizing  a  new 
cost  estimating  procedure.  In  addition,  it  can  help  to 
identify  questionable  sample  points.  Furthermore,  no  informa¬ 
tion  concerning  the  process  has  been  lost.  This  is  in  contrast 
to  the  statistics  discussed  under  the  remaining  two  groupings 
which  depend  on  summar izations  of  the  data — and  most  data 
summarizat ions  imply  a  loss  of  some  information. 

13.  DATA  SUMMARIZAT  IONS  THAT  DO  NOT  DEPEND  ON  A  PARTICULAR  ESTIMATING 

PROCEDURE 

Data  summarizations  (or  statistics)  discussed  in  this  section  have 
the  property  that  they  can  be  calculated  for  any  candidate  cost  estimating 
procedure.  These  summar i zat Lons  can  thus  be  used  to  compare  different 
candidate  estimating  procedures. 

This  lack  of  dependence,  however,  introduces  uncertainty  as  to  what 
data  summarizations  should  be  used.  The  criteria  required  for  measure 
selection,  and  the  theoretical  framework  necessary  for  the  description  of 
measure  properties,  are  usually  provided  by  the  form  of  the  particular 
estimating  procedure  and  the  assumption  of  an  underlying  statistical  model. 
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As  an  example,  Multiple  Linear  Regression  theory  is  based  on  a  statistical 
model  (assumptions)  applicable  to  linear  PERs .  Using  this  model  as  a 
starting  point,  statistical  arguments  can  be  developed  to  pick  the  fit 
technique  (least  squares) ,  to  provide  convenient  summary  statistics 
(t-tests,  standard  error  of  estimate,  etc.),  and  to  describe  summary 
statistic  properties. 

Lacking  the  capability  of  specifying  one  "best"  data  summarization, 
several  different  summarizations  are  suggested  in  this  section.  Arguments 
for  their  use  are  necessarily  heuristic  in  nature,  and  the  choice  of  which 
particular  summarization  to  use  is  left  up  to  the  analyst.  He  can  exercise 
this  choice  by  picking  loss  functions  and  weighting  schemes  best  suited 
to  his  application. 

Before  describing  the  summarizations  it  will  be  useful  to  identify 

the  portion  of  the  Historical  Simulation  output  that  will  be  used.  Only 

the  values  from  the  residual  table — the  d ^  of  Eq.  8 — are  used  as  it  is 

n+k 

the  errors  of  prediction  that  are  of  interest.  Which  of  these  residuals 
to  use  is  not  entirely  clear. 

Using  all  of  the  residuals  is  appealing  in  that  no  information  will 
be  thrown  away.  However,  there  are  problems  involved  in  knowing  how  to 
use  all  of  them  fairly.  The  residuals  are  certainly  not  independent,  a 
fact  that  is  proven  under  the  usual  regression  assumptions  in  Appendix  II. 
Hence,  use  of  all  of  the  residuals  introduces  problems  of  statistical 
interpretation  and  weighting. 

If,  however,  only  one  residual  is  used  for  each  sample  point,  in 

particular  the  one  made  from  the  largest  available  subsample  size — the 

entry  in  the  last  column  of  Table  9,  which  is  d^^  if  there  is  no  grouped 

k 

data — then  the  problems  of  weighting  and  statistical  interpretation  are 


In  fact,  it  is  shown  in  Appendix  II  that  the  one-step  residuals  d^^ 
are  independent  under  the  usual  regression  assumptions. 
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greatly  reduced.  Furthermore,  this  selection  is  not  without  heuristic 
justification.  In  effect,  we  are  looking  at  the  prediction  made  from 
the  largest  available  subsample  size  for  each  procurement.  These  are 
the  subsamples  that  would  have  been  used  and  predictions  that  would  have 
been  made  if  the  cost  estimating  procedure  had  been  used  in  the  past.  In 
addition,  an  estimating  procedure  which  predicts  the  near  future  well, 
need  not  necessarily  predict  the  long  term  future  well. 

For  notational  convenience,  let  us  relabel  these  residuals  by 

K  R  ,  .  .  .  ,  R,,  ,  where  n  was  the  minimum  sample  size 

n  +1  n  Fi  o  r 

o 

used  in  the  Historical  Simulation,  and  N  is  the  size  of  the  entire  data 
base.  The  collection  of  these  residuals  will  be  referred  to  as  R. 


The  question  being  addressed  in  this  section  then  is  how  to  summarize 
the  data  in  R  ,  so  that  one  can  choose  between  several  estimating  procedures, 
in  addition,  it  will  be  useful  if  these  summar izations  indicate  how  well 
the  estimating  procedure  will  do  in  the  future. 

I .  Some  Kxample  Data  Summar izations 

One  such  summarization  is  that  of  average  proportional  error.  It 
is  calculated  as  follows. 


Average  Proportional  Krror  = 


—  T 

i  -n  /  J 


(9) 


i=n  +1 

o 


1'his  does  not  imply  that  all  the  predictions  of  the  most  recent  data  points 
are  ignored.  On  the  contrary,  as  will  be  seen  in  this  section,  predictions 
of  the  most  recent  data  points  will  receive  at  least  as  much  emphasis  as 
predictions  of  the  earlier  data  points.  But  the  particular  prediction 
used  will  be  from  the  Largest  data  base  possible  for  such  a  prediction. 


id 
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where  y  is  the  actual  cost  of  the  procurement  indexed  by  i 

nQ  is  the  minimum  sample  size  used  in  the  Historical  Simulation 
and  N  is  the  size  of  the  data  base 

The  average  proportional  error  should  be  used  when  one  is  worried 
about  proportional  cost  errors  rather  than  absolute  cost  errors.  In 
addition,  this  measure  is  probably  the  easiest  to  communicate  (and  as 
such  is  a  good  candidate  for  the  desired  measure  described  in  Sec.  II  C)  . 
Every  cost  analyst  has  been  asked  to  indicate  how  reliable  his  prediction 
is;  for  example,  is  it  within  ±10  percent?  Having  calculated  the  average 
proportional  error,  he  can  answer  this  query  by  saying,  "The  cost  estimating 
procedure  from  which  this  estimate  has  been  derived  has  an  average 
proportional  error  of,  say  15  percent,  which  implies  that  if  it  had  been 
used  to  make  these  types  of  predictions  in  the  past  it  would  have  been 
off,  on  the  average,  by  15  percent."  Hence,  a  reasonable  answer  to  the 
query  would  be  that  an  error  of  +15  percent  should  be  expected. 

Contrast  the  above  answer  to  one  made  from  the  usual  regression 

2 

theory  output  utilizing  statements  of  F-tests,  t-tests,  R  ,  prediction 

* 

intervals,  etc.  How  aware  of  the  underlying  statistical  assumptions  or 
the  meaning  of  these  statistics  is  the  recipient  of  the  prediction  results? 
Their  meaning  is  certainly  not  as  universally  understandable  as  is  average 
proportional  error. 

There  are,  of  course,  drawbacks  in  using  averages  associated  with 
average  proportional  error,  a  topic  which  will  be  discussed  more  fully 
in  Sec.  IV  B  3 ,  Additional  Considerations.  In  addition  to  these  problems, 
however,  average  proportional  error  places  the  same  emphasis  on  predictions 
made  from  a  sample  of  size  5  as  predictions  made  from  a  sample  of  size  12. 


See  Ref.  4  for  the  interpretation  of  these  statistics  in  cost  analysis. 
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For  any  cost  estimating  procedure  that  makes  use  of  every  data  point  in 
its  subsampLe,  this  equality  of  weighting  may  seem  unjustified.  After 
all,  predictions  should  be  getting  better  as  the  sample  size  increases. 
Hence,  the  following  weighted  average  proportional  error  is  suggested: 

A  W.lR.I 

Weighted  Average  Proportional  Error  =  y  — — -  (10) 

i=n  +1  1 

o 


The  weights,  of  course,  add  up  to  one 


N 

E  "i  ’  1 

i=n  +1 
o 


and  varies  proportionally  with  the  sample  size.  They  can  be  as  extreme 
as  assigning  all  weight  to  N  ,  which  is  a  choice  that  might  be  made  by 
an  analyst  who  feels  that  most  information  is  contained  in  the  one 
prediction  made  from  the  largest  subsample  size.  My  own  preference  for 
a  weighting  scheme  is 


S . 


i=n  +1 

o 


(ID 


where  S  is  the  subsample  size  used  for  the  particular  prediction.  This 
equation  would  give  the  predictions  from  subsample  size  10  twice  as  much 
weight  as  the  predictions  from  subsample  size  5,  and  thus  is  in  accordance 
with  the  notion  that  if  the  estimating  procedure  is  valid,  then  predictions 
should  improve  as  the  sample  size  gets  larger.  Furthermore,  the  use  of 
this  type  of  weighting  scheme  does  not  effectively  change  the  simple 
interpretation  of  the  summary  statistic  discussed  for  Eq .  9. 
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Another  alternative  to  average  proportional  error  is  that  of  sq uared 
average  proportional  error,  i.e., 


Squared  Average  Proportional  Error 


N 


_L _ 

N-n 

o 


£  <w2 

i=n  +1 
o 


(12) 


One  would  use  this  type  of  summarization  when  he  wishes  to  penalize 
proportional  errors  in  an  exponential  fashion. 


Finally,  one  might  be  more  concerned  with  absolute  rather  than 
relative  error.  A  calculation  such  as 


Average  Squared  Error  = 


N-n 


(V 


i=n  +1 
o 


(13) 


could  be  made.  Although  this  statistic  appears  to  be  similar  to  the 

>v 

calculation  of  the  variance  estimate  in  regression  theory,  the  residuals 
in  question  here  are  based  on  predictions,  not  fits. 


2 .  A  General  Framework  for  the  Data  Summarization 

The  data  summarizations  suggested  so  far  can  be  placed  into  a  general 
framework  through  the  use  of  loss  functions  and  weighting  schemes.  Let 
£(R_)  denote  the  loss  (or  penalty)  that  will  be  assigned  to  the  residual 
value  ,  and  let  VL  be  the  weight  assigned  to  each  residual,  e.g., 

Eq .  11.  Then  the  average  loss  for  the  weighting  scheme  W  and  the  loss 
function  Jt  can  be  defined  by 


A(S>  ,W) 


N 


W  «.(R  ) 


i=n  +1 
o 


(14) 


Standard  error  of  the  estimate  squared. 
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Lt  the  weight  W.  could  be  interpreted  as  the  probability  of  R. 
occurring,  then  the  average  loss  calculation,  defined  in  Eq.  14,  is 
equivalent  to  the  calculation  of  expected  loss  in  statistical  decision 
theory  (Ref.  7,  Chapter  5).  In  this  latter  context,  the  decision  rule 
(estimating  procedure)  with  the  smallest  expected  loss  would  be  chosen. 

The  analogous  rule  in  the  Historical  Simulation  context  is  to  prefer  the 

Vc 

os' (mating  procedure  with  the  lowest  average  loss. 

Each  of  the  example  data  summarizat ions  previously  specified  is  a 
special  case  of  the  generalized  average  loss  identified  in  Eq .  14. 

Weighted  Average  Proportional  Error,  Eq.  10,  is  obtained  by  letting 
i (R^)  =  |R.|/y.  ,  while  average  proportional  error,  Eq .  9  implicitly  uses 
the  weighting  scheme  defined  by 

W.  =  1/N-n  (15) 

l  o 

This  latter  weighting  scheme  is  used  for  each  of  the  other  averages 
previously  discussed  with  the  loss  function  defined  by 

i  (R.)  =  (R,/yJ“  for  Eq .  12 

and  by 

•) 

«.  (R  . )  =  R7  for  Eq  .  1 3 

t  l 

Any  average  loss  can  be  used  for  ranking  alternative  estimating 
procedures.  The  analyst  need  only  specify  the  loss  function  and  weighting 
scheme  best  suited  for  his  particular  problem.  For  example,  alternative 
loss  functions  might  be  devised  to  give  a  greater  penalty  to  underestimates 
than  overestimates.  (All  of  the  example  loss  functions  previously 

Additional  considerations  are  identified  in  Sec.  IV  B  3. 
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identified  give  equal  penalty  to  these  errors).  Such  a  loss  function  is 
portrayed  in  Fig.  2  and  defined  by 

if  R  -•  0 

if  R  <  0 

Furthermore  the  loss  function  need  not  be  smooth.  If  one  is  very  concerned 
about  underestimates,  doesn't  care  about  overestimates  with  residual 
values  of  0  to  15,  and  is  only  mildly  concerned  about  greater  overestimates, 
then  the  loss  function  given  by 

if  R  _>  15 
if  0  <_  R  <  15 
if  R  <  0 

could  be  used.  This  positive  side  of  this  loss  function  is  shown  as  the 
dashed  lines  in  Fig.  2. 

There  are  some  properties  of  specific  weight  and  loss  functions 
which  in  the  author's  mind  make  certain  choices  more  natural  than  others. 
These  considerations  may  help  the  analyst  to  choose  the  weighting  scheme 
and  loss  function  best  suited  for  his  application. 

Regarding  the  weighting  scheme,  if  the  candidate  estimating 
procedure  makes  use  of  the  entire  subsample,  then  the  weights  given  in 
Eq.  11  appear  most  natural.  It  implies  that  the  estimating  procedure's 
predictive  capability  is  directly  proportional  to  the  sample  size. 
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A  slight  variation  to  this  weighting  scheme,  but  having  similar 
properties,  is  one  that  is  based  on  degrees  of  freedom.  Let  k  be  the 
number  of  parameters  to  be  estimated  in  the  candidate  PLR.  Define  W. 
by 


W.  = 
1 


S.  -  k 
1 _ 

N 

Z  <si  -  k) 

i=n  +1 
o 


(16) 


The  weights  are  all  positive  since  the  minimum  sample  size  n^  for 

Historical  Simulation  was  defined  in  such  a  manner  that  n  >  k  .  Hence, 

o  ’ 

>  k  .  This  particular  weighting  scheme  is  analogous  to  adjusting 
o 

for  degrees  of  freedom  in  the  usual  regression  statistics.  It  implies 
that  the  estimating  procedure's  predicitve  capability  is  directly 
proportional  to  degrees  of  freedom. 

A  candidate  es_^mating  procedure  that  does  not  make  full  use  of  the 
subsample  at  each  stage  of  the  Historical  Simulation  requires  a  different 
weighting  scheme.  For  example,  if  the  estimating  procedure  only  makes  use 
of  the  most  recent  four  data  points  in  each  subsample,  then  a  weighting 
scheme  such  as  Eq.  15  would  seem  reasonable. 

Regarding  what  loss  function  to  select,  if  one  is  interested  in 
relative  error,  then  the  loss  function  used  might  be  that  used  in  Average 
Proportional  Error,  Eq.  10  namely  t(Rj  =  |R  |/y^  .  It  has  the  advantage 
of  being  easy  to  communicate  as  discussed  in  the  paragraphs  following 
Eq .  9  . 


If  one  is  interested  in  absolute  error,  then  the  loss  function  used 

2 

in  Eq.  13,  namely  £(Rj  =  (Rj  could  be  used.  It  has  the  advantage 
of  being  the  analogous  calculation  to  the  square  of  the  standard  error  of 
the  estimate  from  regression  theory.  The  latter  is  the  quantity  minimized 
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in  least  squares  (il'  it  were  unadjusted  for  degrees  of  freedom)  and  hence 
has  the  advantage  of  precedent. 

A  disadvantage  to  this  ioss  function  is  that  it  is  not  as  easy  to 
communicate  as  average  proportional  error.  However  the  closely  related 
Loss  function 


v(>V  =  |K.| 


with  average  loss  defined  by 


Average  Absolute  Hrror 


■  E  "Mil 


i=n  +1 
o 


has  the  same  meaning  for  absolute  error  as  Hq .  10  has  for  relative  error. 

It  represents  how  much  one  would  have  been  off  (in  an  absolute  sense)  on 
the  average,  if  he  had  used  this  cost  estimating  procedure  consistently 
in  L ho  past. 

The  recommendations  for  loss  functions  and  weighting  schemes  discussed 
in  this  section  are  summarized  in  Table  10.  The  reader  is  reminded  that 
aLl  of  these  summarizations  are  averages,  and  hence  the  decision  rule  of 
ranking  the  candidate  estimating  procedures  and  taking  the  one  with  the 
smallest  average  Loss  is  an  oversimplification  of  tlie  problem,  particularly 
when  average  Losses  are  very  close  witli  the  result  that  the  difference 
may  not  be  significant.  Some  further  considerations  that  will  help  in 
t lie  estimating  procedure  selection  when  tlie  difference  in  average  losses 
are  small  are  given  in  the  next  sulisection. 

i .  Add i t ionai  Considerat ions 

Suppose  for  a  particular  application  average  proportional  error  as 
given  in  Kq .  9  lias  been  selected  for  the  average  loss  calculation. 

Suppose  also  tliat  estimating  procedure  A  liad  an  average  proportional 
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TABLE  10 


RESIDUAL  SUMMARIZATION  NOT  DEPENDENT  ON  ESTIMATING  PROCEDURE 


Average  Loss 


N 

Z)v<v 

i*=n  +1 
o 


Form 


Remarks 


Suggested 

Weights 


i 


Suggested 

Loss 

Functions 


Predictive  capability 
of  estimating  procedure 
directly  proportional  to 
Appropriate  for  esti-  sample  size 

mating  procedures  _ 

which  utilize  entire 

subsample  Predictive  capability 

of  estimating  procedure 
directly  proportional  to 
degrees  of  freedom 


Appropriate  for  estimating  procedures  which 
utilize  only  the  last  m  (any  fixed  number 
<_  nQ)  subsample  data  points 


Kty  =  |Ri|/yi 


=  R  l 
=  \\\ 


Appropriate  for  applications  in  which  relative 
error  is  most  important.  Represents  the  average 
proportional  error  that  we  would  have  experienced 
if  we  had  used  the  estimating  procedure  in  the 
past . 


Appropriate  for 
applications  in 
which  absolute 
error  is  most 
Important 


Analogous  to  the  residual 
calculation  in  ordinary 
regression  theory 


Represents  the  average 
absolute  error  that  we 
would  have  been  off  if 
the  estimating  procedure 
were  used  in  the  past 


NOTATION 

N  Total  data  base  size 

n^  Minimum  subsample  size  for  Historical  Simulation 
R^  Residual,  member  of  R 

Subsample  size  used  for  R^  calculation 
k  Number  of  parameters  estimated  in  the  candidat  PER 
y^  Actual  cost  of  procurement  i 
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error  ot  0.2  aiul  estimating  procedure  B  had  one  of  0.25.  Should  A 
always  he  preferred  to  B  At  least  two  additional  questions  are  worth 
ask lug. 

1.  Is  there  any  apparent  bias  in  the  residuals? 

2.  What  type  of  variability  is  there  about  the  average  loss? 

I  he  first  ot  these  considerations  can  be  handled  by  a  different  type 
of  average  value  calculation.  Comparing  the  simple  arithmetic  mean  of  the 
residuals  to  zero  could  be  used  to  indicate  bias,  if  this  calculation  did 
not  imply  a  weighting  scheme  and  loss  function  different  from  the  one 
picked  by  the  analyst  for  the  average  loss  calculation.  Hence,  to  examine 
bias  for  our  purposes,  it  is  suggested  that  the  following  calculation  be 
made  . 


i=n  +1 

o 


where 


W  is  the  weighting  scheme  used  in  the  average  loss 
calculation  of  hq .  14 

•+  is  a  signed  form  of  the  loss  function  used  in  the 
average  loss  calculation 

15 ( .  , W )  is  the  apparent  bias  of  the  estimating  procedure  using 
loss  function  ■  and  weighting  scheme  W. 


Some  explanation  ot  ■  (,K  j  will  be  useful.  If  the  average 

proportional  error  loss  function  is  .  ,  i.e.,  >(K.)  =  | R , |/y.  >  then 

1  11 

,  i|<  )  =  R  /v  .  Hence  the  on  1  v  difference  between  the  two  is  that  H 

i  i  '  i 

retains  the  sign  of  the  residual.  For  the  squared  error  loss  function 
.  fR. )  =  K7  let 

l  i 


IK.)  - 


-(K.) 


if  R.  -•  0 

i  — 

if  R.  <  0 

l 
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Note  that  for  any  loss  function,  it  will  always  be  true  that 
=  £(R.). 


The  bias,  B(£,W)  can  now  be  compared  to  zero.  The  closer  it  is 
to  zero,  the  better  the  estimating  procedure,  if  an  unbiased  estimating 
procedure  is  important. 

The  second  consideration  mentioned  above  is  to  obtain  some  measure 
of  variability  around  the  average  loss.  The  usual  procedure  would  be  to 
make  some  sort  of  a  variance  calculation.  For  example, 

N 

T:  Wi[H(Ri)  -  A(  tL  ,W)  ]  2  (19) 

i=n  +1 
o 

where  A(£,W)  is  the  Average  Loss,  Lq .  14 

The  desirable  property  would  be  for  19  to  be  small.  For  our  purposes, 
however,  this  is  not  very  appropriate.  As  can  be  seen  in  Fig.  3  a  small 
measure  of  variance  would  imply  little  chance  of  small  losses  as  well  as 
large  losses.  While  the  latter  is  to  be  avoided,  the  former  is  clearly 
desirable . 

* 

A  measure  of  skewness  would  hence  be  more  appropriate  than  a 
measure  of  variance.  Negative  values  of  skewness,  close  to  minus  one, 
would  imply  that  most  residuals  had  small  losses,  hence  small  errors.  A 
positive  value  would  imply  the  opposite  and  would  therefore  detract  from 

*  3 

As  defined  by  Cramer,  Ref.  8,  page  184,  as  p^/cj  where  is  the  third 

central  moment  and  o  is  the  standard  deviation. 
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Figure  3(U).  Variance  Calculations  and  Average  Loss 
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an  otherwise  good  value  of  average  loss.  The  skewness  calculation  for 
this  application  is  given  by 


where  A(£,W)  is  the  Average  Loss,  Eq .  1A 

W  are  the  weights  used  in  Average  Loss 
£(R^)  is  the  loss  function  used  in  Average  Loss 

The  example  given  at  the  start  of  this  section  hypothesized  two 
estimating  procedures.  Procedure  A  had  A(£.,W)  =  0.2  ,  while  Procedure  13 
had  A(£,W)  =  0.25  .  Answering  the  question  of  which  one  is  preferable 
can  be  aided  by  calculating  the  measures  just  defined.  Suppose  that  for 
Procedure  A,  B(£,W)  =  0.05  and  S^(£,W)  =  0  .  Then,  if  the  equivalent 
measures  for  Procedure  13  were  B(£,W)  =  0.1  and  S  (£,W)  =  0.5  ,  the  case 

K 

for  selecting  A  over  13  would  be  strengthened.  If,  however,  the 

measures  for  B  were  B(£,W)  =  0.01  and  S  ( £ , W )  =  -0.5  ,  the  case  for 

k 

choosing  A  would  be  weaker.  Procedure  B  is  less  biased  and  shows  a 
large  negative  skewness  which  implies  that  losses  smaller  than  the  average 
were  far  more  plentiful  (or  had  more  weight)  in  the  sample  (and  hence  we 
would  hope  more  likely  in  the  fu. ure)  than  were  Josses  larger  than  the 
average.  Procedure  A,  on  the  otiier  hand,  had  zero  skewness  implying  that 
large  and  small  losses  are  equally  likely. 


A .  Example  Calculations 

To  familiarize  the  reader  with  the  summarizations  suggested  in 
Sec.  IV  B  2  and  the  additional  statistics  defined  in  the  last  subsection 


(IV  B  3),  example  calculations  are  made  and  presented  for  the  computer 
program  test  data. 
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!•  i o in  the  exarnp  1 used  in  the  computer  test  run  summarized  in 
labLes  3  and  7  we  have  the  following  data  (in  the  notation  of  Table  10), 


TABLE  11 

EXAMPLE  DATA 
N  =  13;  n  =  5;  k  =  3 


Sample  Point 

6 

7 

8 

9 

10 

11 

12 

13 

Residual  R 

l 

-11.7 

-66  .9 

31.4 

-9.9 

4.7 

-18.5 

23.7 

50.1 

Subsample  Size  S^ 

5 

6 

7 

8 

9 

10 

11 

12 

Actual  Cost  y 

l 

67 

243 

54 

112 

106 

183 

156 

177 

If  the  proportional  error  loss  function  is  selected,  then  the 
average  loss  will  be  called  average  proportional  error  and  is  given  by 


Average  Proportional  Error 


(after  Eq .  10) 


The  proportional  error  for  each  residual  is  given  in  Table  12. 


TABLE  12 

PROPORTIONAL  ERROR 


Sample  Point 

6 

7 

8 

9 

10 

11 

12 

13 

R . 

l 

-11.7 

-66  .9 

31.4 

-9.9 

4 . 7 

-18  5 

23.7 

50. 

yi 

67 

243 

54 

112 

106 

183 

156 

17 

Proportional  Error 

1  R.  1 

l 1 

0.174 

0.582 

0.088 

0.044 

0.101 

0.152 

0.28 

yi 
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Three  weighting  schemes  have  been  suggested  in  Table  10.  These  can 
be  used  to  modifiy  Eq.  10  as  follows: 


Average  Proportional  Error 

(Weight  proportional  to 
sample  size) 


E<v  ~ 


Average  Proportional  Error  =  — - 

(Weight  proportional  to  V  ^ ^ 

degrees  of  freedom)  /  j  i 

i=6 


■E(si  - 


Average  Proportional  Error 
(Equal  weight) 


IV— 

~34^  Vi 


All  that  remains  is  to  substitute  the  values  of  S.  and  k  from 

l 

Table  11  and  |r  |/y^  from  Table  12  and  carry  out  the  arithmetic.  The 
results  are  given  below.  In  this  particular  example,  the  weights 
do  not  greatly  affect  the  average.  All  average  proportional  errors  are 
around  20  percent. 


Weight 

Proportional  to  Sample  Size 
Proportional  to  Degrees  of  Freedom 
Equal  Weight 


Average 

Proportional 

Error 

0.202 

0.197 

0.212 
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The  tendency  in  this  example  for  larger  proportional  errors  with 
predictions  from  smaller  sample  sixes  can  be  seen  by  t lie  fact  that  the  equal 
weight  measure  gives  the  highest  average  proportional  error  while  the 
degrees  of  freedom  weighting  scheme  yields  the  lowest.  These  weighting 
schemes  give  the  most  and  least  weight  to  residuals  calculated  from  small 
sample  size  predictions  respectively. 


Calculations  for  bias  and  skewness  are  made  for  the  weighting 
scheme  that  is  proportional  to  sample  size  only,  i.e., 


W. 

1 


S. 

l 


13 


This  should  suffice  to  indicate  how  these  measures  are  calculated. 


The  calculation  for  bias,  Eq.  18,  is  very  similar  to  those  for 
average  proportional  error.  All  that  should  be  done  to  Eq .  21,  to  obtain 
the  signed  loss  (proportional  error) ,  is  to  remove  the  absolute  value 
sign  from  R_  .  Alternatively,  one  can  use  the  proportional  error  from 
Table  12  and  assign  the  sign  of  from  the  same  table.  Thus,  for  data 

point  6,  we  have  signed  proportional  error  equals  -0.174.  The  values  to 
be  averaged  are  given  in  Table  13  and  the  modified  Eq.  18  for  bias  is 
given  as  Eq.  24.  The  value  obtained  for  bias  is  0.078.  Note  however, 
that  the  numbers  of  over-  and  underestimates  are  the  same.  The  large 
error  in  estimating  procurement  8  dominates  the  bias  calculation. 


Bias ( £ ,W) 


(24) 
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where  It  i.s  proportional  error 

W  is  the  weighting  scheme  proportional  to  sample  size 


TABLE  13 

BIAS  CALCULATION  VALUES 


Sample  Point 

6 

7 

8 

y 

10 

11 

12  13 

s. 

1 

5 

6 

7 

S 

9 

10 

11  12 

Signed  ) 

Proportional 
Error  j 

1  Ri 

I  yi 

-0.174 

-0.275 

0.582 

-0.088 

0.044 

-0.101 

0.152  0.283 

By  far 

the  hardest 

measure 

to  calculate  is 

skewness 

.  The 

modified 

version  of  the  skewness  equation,  Eq.  20,  is  given  below: 

13  ,  .3 


skU,w)  = 


where 


is  the  proportional  error 


is  the  sample  size 

0.202  is  the  average  proportional  error  for  the  example 
l  is  proportional  error 

W  is  the  weighting  scheme  proportional  to  sample  size 

The  necessary  data  for  the  calculation  can  be  obtained  from  Tables  11 
and  12.  A  measure  of  skewness  equal  to  O.lbb  is  obtained.  Hence,  the 
distribution  shows  some  positive  skewness,  the  large  overestimate  of 
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sample  point  8  outweighing  the  fact  thaL  5  of  the  H  proportional  errors 
are  less  than  the  average. 

it.  is  hoped  that  the  example  calculations  carried  out  in  this  section 
serve  as  a  guide  to  help  the  reader  make  calculations  of  average  loss, 
bias,  and  skewness  for  the  loss  function  and  weighting  scheme  best  suited 
for  his  problem.  It  will  be  useful  now  to  consider  the  possible  directions 
of  future  work  that  might  improve  these  measures. 

5 .  Future  Work 

As  pointed  out  at  the  beginning  of  this  section,  the  arguments 
presented  for  the  various  residual  summar izations  and  other  measures  have 
been  heuristic  in  nature.  This  was  due  to  the  lack  of  an  assumed  under¬ 
lying  statistical  model.  The  arguments  are  hence  analogous  to  those  that 
are  used  for  various  curve  fitting  schemes  when  a  statistical  model  has 
not  been  assumed. 

Several  possible  courses  of  action  might  be  taken  to  either  make 
the  arguments  for  these  statistics  more  rigorous  or  to  derive  better 
measures.  Formal  methods  of  nonparametric  statistics  might  be  useful  in 
making  more  rigorous  the  comparison  between  estimating  procedures  A  and  B 
at  the  end  of  Sec.  IV  B  3.  Another  possibility  is  to  explore  the  use  of 
average  loss  for  a  ranking  technique  for  several  classes  of  candidate 
estimating  procedures  and  their  implied  statistical  models.  This  could 
be  accomplished  with  the  aid  of  Monte  Carlo  techniques.  The  probability 
of  selecting  the  wrong  estimating  procedure,  i.e.,  making  an  incorrect 
ranking,  could  be  estimated. 

The  effort  required  to  investigate  these  possibilities  is  certainly  no 
trivial.  In  the  meantime,  the  statistics  suggested  appear  to  be  reasonable 
and  should  help  the  analyst  to  make  choices  between  any  candidate  estimat¬ 
ing  procedures.  In  addition,  several  of  the  statistics  identified,  i.e., 
Average  Proportional  Error,  Eq .  10,  and  Average  Absolute  Error,  Eq.  17, 
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have  interpretations  that  are  easy  to  communi cate  and  can  be  used  to  give 
one  a  feeling  of  the  estimating  procedure's  validity.  They  summarize 
the  error  which  would  have  been  present  (on  the  average)  if  the  candidate 
estimating  procedure  had  been  used  to  predict  the  cost  in  the  past.  Thus, 
these  measures  are  good  candidates  for  the  dosLred  measures  identified 
in  Sec  .  Ii  C  1 . 


C.  STATISTICS  WHICH  DEPEND  ON  A  PARTICULAR  ESTIMATING  PROCEDURE 

A  final  set  of  statistics  can  be  calculated  from  the  Historical 
Simulation  output  by  making  use  of  any  statistical  model  assumptions  that 
are  usually  associated  with  the  particular  cost  estimating  procedure 
under  examination.  An  example  is  the  multiple  linear  regression  model, 
which  is  usually  assumed  when  the  cost  estimating  procedure  of  interest 
comprises  a  linear  PER  and  a  least  squares  technique  for  estimating  the 
parameters.  Another  example  is  the  use  of  a  multiplicative  error  term 
6  with  a  log-normal  distribution  when  the  PER  is  given  by 

b  b  b 

Y  =  aX,  Xn  •  •  •  X  P 

12  p 

where  Y  stands  for  cost,  X^,  X^  ,  .  .  .  ,  X^  are  the  independent 
variables,  and  the  technique  is  a  least  squares  curve  fit  performed  on 

log  Y  =  log  a  +  b^  log  X^  +  •  •  •  +bp  log  X^ 


>v 

In  fact,  the  choice  of  the  least  squares  technique  can  be  viewed  as  a 
consequence  or  the  multiple  linear  regression  model  assumption  (for  a 
linear  PER)  as  the  estimators  obtained  have  some  optimal  properties. 

These  properties  are  stated  in  the  Gauss-Markov  theorem.  According  to 
Ref.  7,  page  387  ,  this  theorem  state*-  L,iat  ’the  least  squares  estimate 
in  the  class  of  unbiased,  linear  estimates,  has  a  minimum  variance 
property:  the  variances  of  its  components  are  (simultaneously'!  smallest. 

In  addition,  they  are  maximum  likelihood  estimates. 

For  additional  information  concerning  this  distribution  see  Ref.  7, 
page  89. 
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Tlie  .sLal_lstie.il  model  In  this  case  Is 


Y 


i 


aX 


1  i 


h 


L 


where  log  '  is  distributed  normally  with  zero  mean,  variance  equal  to 

>  *■ 

■*"  ,  and  zero  covariances. 


From  an  operational  point  of  view,  statistics  based  on  an  assumed 
distribution  are  less  versatile  than  the  general  data  summarizations 
discussed  in  Sec.  IV  15.  They  are  valid  only  if  the  assumed  statistical 
model  is  valid. 


However,  these  statistics  are  still  worth  examining.  Since  they 
are  valid  for  any  estimating  procedures  which  utilize  the  same  statistical 
assumptions,  for  example,  the  class  of  linear  PERs  (with  least  squares  curve 
fit),  they  can  be  used  to  compare  candidate  estimating  procedures  in  the 
class.  However,  these  comparisons  can  also  be  made  with  the  usual 
evaluation  procedures,  i.e.,  the  usual  regression  statistics,  and,  hence, 
benefits  gained  using  Historical  Simulation  do  not  include  a  comparison 
that  cannot  otherwise  be  directly  made  (Sec.  II  C,  Property  2). 

Another  use  for  these  statistics  is  to  ascertain  whether  the 
statistical  model  and/or  estimating  procedure  is  valid.  Does  the  Historical 
Simulation  output  fit  in  with  the  output  that  should  be  theoretically 
expected,  assuming  that  the  statistical  model  and  estimating  procedure 
assumptions  are  valid?  If  the  output  does  not  fit,  then  some  of  these 
assumptions  must  have  been  violated  and  hence  the  model  should  not  be 
accepted  . 

Finally,  statistics  that  are  usually  calculated  (on  the  entire  sample 

for  the  estimating  procedure,  can  be  derived  for  each  of  the  Historical 

2 

Simulation  subsamples.  For  example,  R  and  standard  error  of  the 
estimate  can  be  calculated  for  each  subsample,  if  the  estimating  procedure 
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assumes  a  linear  PER  and  a  least  squares  curve  fitting  technique.  These 
statistics  can  be  used,  in  the  traditional  manner,  to  evaluate  ho-.;  well 
the  estimating  procedure  is  performing  on  the  particular  subsample.  Would 
the?  model  have  been  acceptable  for  the  particular  subsample?  Would  it  have 
been  rejected  for  a  larger  subsampLe? 

Work  accomplished  to  date  on  the  development  of  these  statistics 
has  been  confined  to  linear  PERs,  least  squares  estimating  techniques, 
and  the  usual  multiple  linear  regression  model.  This  class  of  estimating 
procedures  have  been  labeled  Linear  PER-Least  Squares  Procedures.  The 
development  of  these  statistics  for  this  class  of  estimating  procedures 
will  have  the  added  benefit  that  their  study  will  more  clearly  define 
the  relationship  between  the  Historical  Simulation  output  and  the  usual 
multiple  linear  regression  evaluations. 

To  date  the  theoretical  distribution  of  the  Historical  Simulation 
output — the  predictions  and  residuals — have  been  determined.  A  goodness- 
of-fit  test  and  a  test  to  determine  if  there  is  bias  present  have  been 
defined  for  a  subset  of  the  Historical  Simulation  output.  Finally,  several 
statistics  have  been  identified  that  are  useful  in  describing  subsample 
fits.  Each  of  these  topics  will  be  discussed  in  subsequent  paragraphs, 
but  first  it  will  be  useful  (for  clarity's  sake)  to  define  the  Historical 
Simulation  procedure  (for  multiple  linear  regression  models)  in  matrix 
notation. 

We  are  given  a  sample  which  consists  of  N  P+l-tuples 
(yi  ’  Xil  *  xi2  »  •  •  •  >  xip)  for  i  =  1,  2  ,  .  .  .  ,  N  .  These  P+l-tuples 
have  been  ordered  in  time. 

The  usual  multiple  linear  regression  hypothesis  is  given  by 

y  =  xe  +  t 


UNCLASSIFIED 


55 


UNCLASSIFIED 


is  a  N  x  1  column  vector  of 
observed  y  values 


X12  • 
*22  • 


N2 


Xlp\ 

X2p  \ 


is  a  N  x  p+1  matrix  of 
independent  variable  values 
(and  a  1  for  the  constant 
multiplier) 


is  the  p+1  x  l  column  vector 
of  model  coefficients 


is  a  N  x  l  column  vector 
of  error  terms 
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The  matrix  X  and  the  vector  (i  are  assumed  to  be  nonrandom;  , 
on  the  other  hand,  is  a  normally  distributed  random  vector  wi  Ji  zero  means, 
a  constant  variance  c/  ,  and  zero  covariances.  That  is 


i  =  o 

V rr iance  (t.  )  =  c/ 


Covariance (c 


i  =  l,  2  ,  .  .  .  ,  N 


i.Cj)  =0  ;  i  *  j  / 


Let  n^  be  the  minimum  sample  size  that  is  greater  than  or  equal 
to  the  smallest  sample  size  necessary  to  carry  out  a  linear  regression 
analysis.  Hence,  >_  p  +  2  .  For  any  n,  nQ  ^  n  <  N  ,  define  the 

following  partition  of  the  X  matrix  by 


N-n  rows 


Also  partition  Y  in  a  similar  manner  obtaining 


n  entries 


N-n  entries 


If  time  batcher-  are  ignored,  the  Historical  Simulation  Procedure 
can  be  defined  as  follows: 

For  each  n,  n  <  n  <  N 
o  — 

1.  Hake  a  least  squares  fit  using  and  X^n^  as  the  data 

base . 

— ■ \ 

2.  Obtain  an  estimating  vector  of  .  Denote  this  vector 
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where 
s  ize 


where 

,ln) 

'  n+k 

1  . 

Tab  le 
re sul t 


UNCLASSIFIED 


1.  Use  Che  resulting  fie  to  make,  predictions  of  the  remaining 
N-n  data  points.  This  can  be  denoted  by 


x(n)  '(n) 


^  ( n ) 

v  ,  is  the  prediction  of  y  ,,  arrived  at  using  a  sample  of 
■ n+k  n+k 

n . 


4.  Calculate  the  residuals  by 


d^  denotes  the  difference  (residual)  between  the  predicted 
n+k  - 

and  the  obse rved  y  ,  ,  . 

'  n+k 


Distribution  of  the  Historical  Simulation  Predictions  and  Residuals 
The  form  and  distribution  of  this  output  has  been  summarized  in 
14,  with  derivations  given  in  Appendix  11.  Several  interesting 
s  which  can  he  observed  from  this  table  are  discussed  below. 
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TABLE  14 

FORM  AND  DISTRIBUTION  OF  PREDICTIONS  AND  RESIDUALS 
(Assuming  the  usual  multiple  linear  regression  model  assumptions) 


Notation 

Prediction 

y(n) 

yn+k 

Res  1  dual 
,  (n) 
dn+k 

Sample  point  for  which 
the  prediction  (residual) 
pertains 

n+k  ;  0  '  k  N-n 

n+k  ;  0  • 

Subsampte  size  used 

n 

n 

Calculati on 

n 

\  \n+k 

2A j  yj 

M 

(n) 

•^n+k  ^n+k 

Distribution 

Normal 

Normal 

Expected  value 

0 

Variance 

oW 

n+k 

2o  -  c* 

Covariances  (with) 

y(m) 

ynrt-j 

d(m> 
nrf  j 

m+j  <  n 


nr+j  >  n 

nrfj  f  n+k  - 


and 


m+j  =  n+k  - 


m  >  n 


2rn+k 

o  C 

m+j 


N-n 


2,Mk 

C  ‘  nrfj 


-2(>  +  <S) 


m+  j 


^saine  as  above  with  teplacing  ('nT+ j  / 


n+j 
n+k  \ 


whe  re 


and 


x’  is  a  row  vector  equal  to  the  *t\\  row  of  the  natrix  X 

x  is  a  column  vector  equal  to  the  _*_th  row  of  tin*  matrix  \ 

is  the  j^th  component  of  the  viator  Y 


,.n+k 


An) 


n+k" 


S  (n  >  =  x|*n^xjn*  where  xj°'  is  the  first  n  rows  ot  X 
and  XJ  n)  »s  ns  transpose 
r  are  the  unknown  pirameters 

is  the  variance  nf  the  error  terms 
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The  distribution  of  the  prediction  and  residuals  is  normal.  The 

predictions  are  unbiased  in  the  sense  that  "  xn+i^  "  Ryn+k 

Hence,  the  expected  vaLue  of  the  residuals  is  zero.  The  variance  of  the 
residuals  is  related  to  the  variance  of  the  predictions  in  the  sense  that 

VAR(dn+k}  “  °2  +  VAR(yn+k} 

This  is  a  consequence  of  the  fact  that  yn+R  is  not  used  in  the  calculation 

of  y(n)  and  hence  is  independent  of  y  . 

J  n+k 

The  residual  covariances  are  related  to  the  prediction  covariances; 
in  fact  they  are  equal  unless  one  of  the  points  being  predicted  is  not 
predicted  from  both  subsampies,  or  the  point  being  predicted  is  the  same 
for  both  subsamples.  In  the  first  of  these  exceptions,  i.e.,  when 

comparing  >  -U  -  «  “  .  n+k  , 

zero.  In  die  second  exception,  i  .e.,  when  comparing  dn+k’  dnrf j  * 
the  calculation  is  similar  to  a  variance  calculation.  The  residual  covari¬ 
ance  is  obtained  by  adding  ,2  to  the  prediction  covariance.  To  summarize, 

then,  we  have  for  m  n 


2  /  '(n)  ~  (m)\ 

5  +  C0V(yn+k’ym+j) 


/  (n)  ' (m)\ 

C0V(yn+k’Vj) 


;  m+j  <_  n 
;  m+j  =  r.+k 

m+j  >  n  and 
’  m+j  ^  n+k 


Another  observation  that  can  be  made  about  the  covariances  is  that 
thev  depend  on  the  two  subsample  sites  m  and  n  only  through  which  S 
matrix  to  use.  if  m  _  n  ,  then  S(">  is  used.  (This  is  the  only 

difference  between  the  coefficients  C^.  and  C n+J  in  Table  14).  The 

rule  to  follow  is  a-  the  d  natrix  corresponding  to  the  larger 
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If  it  is  noted  that  S  is  a  function  of  X^n^  (the  data  base 
used  for  the  subsample  size  n  fit)  then  the  above  result  is  not  surpris¬ 
ing.  If  m  n  then  X^  contains  all  the  information  that  X^m^ 
contains  plus  the  extra  rows,  lienee  will  contain  all  the  information 

available  in  S plus  more.  Hence  in  deriving  the  covariance  of  two 
predictions  or  residuals,  the  information  available  in  the  larger  subsample 
fit  is  required  and  includes  the  information  available  in  the  smaller 
sub sample . 


A  final  observation  is  that  although  the  predictions  and  residuals 
are  generally  correlated  (among  themselves) ,  there  are  some  residuals 
which  have  zero  covariance.  In  particular,  if  m+j  <_  n  ,  then 

=  0  .  In  words  this  implies  that  the  residual  calculation 
for  a  particular  sample  point  has  zero  covariance  with  any  residual  calcla- 
tion  based  on  a  subsample  which  includes  the  specified  sample  point.  The 


COVfd^  ,d^ 
n+k  m+j 


importance  of  this  result  lies  in  the  fact  that  zero  covariance  implies 
independence  when  the  random  variables  are  normally  distributed.  Hence 


d^??  and  d^1"^  are  independent  if  m+i  <  n 
n+k  m+j  r  — 

one-step  residuals,  i.e., 


In  particular  then,  the 


(n  )  (n  +1) 

d  , ,  d  ,  ~ 
n  +1  n  +2 
o  o 


(n 
n+1  ’ 


d 


(N-l) 

N 


are  mutually  independent.  This  fact  will  enable  several  statistical  tests 
to  be  applied  to  the  one-step  residuals. 


Before  discussing  these  tests  it  is  notationally  convenient  to 
redefine  the  one-step  residuals  as  follows: 


Let 


r 

n 


1  +  C 


n+1 

n+1 


k 

Zero  covariance  does  not  usually  imply  independence.  The  fact  that  the 
added  condition  of  normalcy  implies  independence  is  discussed  in  Appendix  II. 
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where 


11+ 1  _  • ,  ,  (n)  ■ 

Cn+L  _  Xn+Lb  xn+l  as  defined  in  Table  14 


Since  the  variance  of  d^'1}  =  i“^l  +  (j 


n+l\ 


n+i  V"  '  ~n+l2 
.if feet  of  giving  the  residuals  r  =  ( r  ,  r 


this  transformation  has  the 

n  +1»  •  •  ■  ,  r  )  a  common 
o 


va 


ar Lance  J  .  Hence,  we  have  that  the  residuals  r  are  independent  and 

2 

normally  distributed,  with  zero  mean,  and  common  variance  o  .  Alternatively 
r  constitutes  a  random  sample  from  a  normal  population  with  zero  mean, 
and  variance  equal  to 


2  .  Tests  on  the  One-Step  Residuals 

Two  types  of  tests  iiave  been  constructed  for  the  one-step  residuaxs. 
The  first  is  a  goodness-o f-f it  test  which  asks  the  question:  Do  the  one- 

step  residuals  appear  to  have  been  derived  from  a  multiple  linear 
rearessio):  moded  e'ith  the  assumed  linear  PER?  The  second  addresses  the 
question  of  bias  and  asks  the  question:  Do  the  one-step  residuals  appear 
,’l  Cl  re  m  an  (  m  then  theoretieallu  should)? 


The  question  of  whether  the  modei  assumptions  are  satisfied  has 
not  been  one  of  the  central  questions  for  theoretical  statisticians.  To 
be  sure,  a  great  body  of  knowledge  has  been  built  up  around  the  closely 
related  subject  of  hypothesis  testing,  but  these  tests  are  concerned  with 
choosing  between  two  states  of  nature,  the  null  hypothesis  and  a  specified 
alternative  hypothesis.  The  question  we  are  asking  can  be  placed  in  the 

—y 

hypothesis  testing  context.  The  null  hypothesis  is  that  r  is  a 

random  sample  from  a  normal  population  with  zero  (0)  mean,  and  variance 
) 

c~  .  Notationally  this  is  given  by: 

»0:r  =  N(0,  ’21) 


It  should  be  pointed  out  that  zero  covariances  were  the  requirements  for 
these  tests.  Hence,  the  tests  would  appear  to  be  equally  applicable  to 
all  residuals  if  these  residuals  were  orthogonalized.  Pursuit  of  this 
topic  is  beyond  the  scope  of  the  present  work,  however. 

"The  alternative  hypothesis  can  be  a  class  of  alternative  hypotheses  such 
as,  "The  random  vector  f  is  from  a  normal  distribution." 
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where  I  is  the  identity  matrix  of  order  N-n  . 

o 

The  alternative  hypothesis  is  not  specific,  however.  It  is  that  r 
2 

is  not  N(0,o  1)  .  Hence  the  usual  techniques  for  hypothesis  testing, 
e.g.,  maximum  likelihood,  are  not  applicable.  Fortunately,  a  few  tests, 

called  goodness-of-fit  tests,  have  been  devised  to  handle  this  question, 

* 

but  they  are  unfortunately  not  very  powerful  against  specific  alternatives. 
Hence,  if  the  question  is  to  choose  between  two  specified  alternatives,  a 
test  built  around  these  alternatives  should  be  developed. 

jWc 

The  two  main  types  of  goodness-of-f it-tests  are  the  Chi-Square 

test  and  tests  that  compare  distribution  functions.  The  Chi-Square  test 

requires  a  partition  of  the  sample  and  a  comparison  of  the  frequency  of 

observations  to  the  theoretical  frequency.  This  test  in  general  requires 

a  large  sample  size  and  is  therefore  not  very  useable  for  the  cost 
k  k  k 

application . 


There  are  two  types  of  errors  that  can  be  made  in  a  hypothesis  testing 

problen.  A  type  1  error  is  made  when  11^  is  rejected  and  it  was  true. 

A  Type  II  error  is  made  when  H  is  rejected  and  it  was  true.  Denote 

the  probability  for  these  two  types  of  error  by  P  (Reject  H  )  and 

0  u 

P  (Reject  II  )  ,  respectively.  The  statement  that  goodness-of-fit  tests 

1  1 

are  not  very  powerful  against  specific  alternatives  implies  that  in 

general  there  exists  a  hypothesis  test  for  the  specific  alternative  such 

that  for  a  given  P  (Reject  lO  ,  P  (Reject  11  )  ,  using  this  other  test, 

H0  U  1  1 

is  less  than  P  (Reject  H  )  using  the  goodness-of-fit  test.  For  a 
1  1 

further  discussion  of  this  concept  see  Ref.  7,  Chapter  7. 

k  k  . 

See  Ref.  7  Section  9.1  for  a  complete  discussion. 

kkk 

See  Ref.  9,  page  46. 
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Of  tlie  tests  that  compare  distribution  functions,  the  Kolmogorov- 
Smirnov  (h— S)  test  is  perhaps  the  most  widely  known.  Its  advantage  over 
the  Chi-Square  Lest  is  that  it  appears  to  be  more  powerful  and  it  is 
applicable  lor  small  sample  sizes.  The  test  is  also  relatively  simple. 
Tailored  for  our  present  application,  it  is  outlined  below. 


Order  the  residual  vector  r  from  smallest  to  largest  to  obtain 

(1)  (2)  (N~no>  (i) 

r  >  r  ,  .  .  .  ,  r  where  r  is  the  jth  order  statistic. 

Calculate  the  sample  distribution  function  F.t  (r)  by  letting 

N-n  y  ° 

o 


=  _J- 


F.,  (r)  =  — * 

N-n  N-n 

o  o 


for  r<'j)  j_  r  <■  r(j  +  1)  ;  j  =  0,  1  ,  .  .  .  ,  N-no  (26) 


whe  re 


(U) 


and  r 


(N-n  +1) 
o 


Tills  sample  distribution  function  is  then  compared  to  the  theoretical 

2 

distribution  function  under  11^  ,  i.e.,  F(r)  =  N(0,o')  .  The  test 
statistic  is  defined  by 


U 


=  Sup  IF  (r) 
N-n  all  r1  N-n 
o  o 


-  F(r)  | 


that  is,  I)  is  the  largest  absolute  difference  between  the  two 

N  —II 

o 

distribution  functions.  It  can  be  shown  that  the  distribution  of  D„, 

N-n 

is  not  dependent  on  the  distribution  of  F(r)  .  Values  of  the 

distribution  of  l),  are  tabulated  in  most  statistics  books  (see 
N-n 


o 

Ref.  7,  Table  VI)  and  rejection  values  are  given  based  on  the  significance 

level  of  the  test  desired  (i.e.,  the  probability  of  a  Type  I  error 

allowable).  Thus  all  that  remains  is  to  determine  Dk, 

N-n 


o 


See  Ref.  9,  page  51. 
Ref.  7,  page  300. 
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For  our  application,  this  constitutes  nothing  more  than  comparing 

^N-n  *-r)  co  F(r)  at  the  end  points  of  the  steps  in  F..  (r)  .  Thus 

IN  u  N-n 

0  o 

D  will  be  the  maximum  of  the  numbers 

N-n 

o 


1-1-1 
'  N-n 


-  F(r^) 


and 


1  N-n 


-  F(r(j))[ 


=  1 


N-n 


(27) 


( i )  2 

The  quantities  F(r  )  are  easy  to  determine  for  a  given  a  .  By 

using  any  table  of  the  normal  distribution,  one  merely  looks  up  the  value 

of  the  percentile  of  the  Normal  distribution  function  for  r^Vo  •  A 

2 

problem  arises  however  by  what  value  to  use  for  o  . 


Three  candidates  are  presented  and  discussed  in  Appendix  III  and 
the  square  or  the  Standard  Error  of  the  Estimate  which  is  obtained  in  the 
usual  regression  analysis — from  the  fit  on  the  entire  sample  N— is  selected. 
The  choice  was  based  on  the  fact  that  it  was  the  most  efficient  estimator 
and  that  unlike  the  other  candidates,  it  does  not  depend  directly  on  the 

residuals  in  r  .  Furthermore,  it  is  the  estimate  of  the  variance  that 

*  2 

is  normally  used  in  a  regression  analysis.  The  estimator  is  denoted  o 
and  the  equation  for  calculating  it  is  given  by: 

9  IX  -  X 

■  1-i  -  (P+i)  <2s> 

where  y_^  is  the  actual  cost  of  the  _ith  procurement 

y  is  the  estimated  cost  of  the  ith  procurement  (obtained 
from  a  regression  analysis  of  the  entire  sample) 

and  P  is  the  number  of  independent  variables  in  the  PER 


The  K-S  test  is 
Hq  ,  as  r  is  a  random 
is  not  ncessarily  valid 


then  valid  as  long  as  we  de 

2 

s'amp  1  e  f rom  a  N  ( 0 ,  o  )  dist 
for  the  wider  null  hypothes 


fine  the  null  hypothesis, 
:ibution."  The  test 
Is  of  11  defined  as 
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2  2  -2 

"r  is  a  random  sample-  from  N((J,o  )  with  j  estimated  by  o  There 

are  indications,  however,  that  if  rejection  takes  place  then  rejection 

) 

would  also  Lake  place  if  were  known  (Ref.  9,  page  60).  In  addition, 

Darling,  in  Ref.  10,  has  described  some  conditions  under  which  variations 
of  Lhe  k-S  test  are  valid  for  the  wider  hypotheses.  How  the  present 
application  fits  into  his  work  has  yet  to  he  determined.  Further  research 
will  have  to  he  done  on  extending  the  current  application  to  this  wider 
hypothesis . 

The  second  test  proposed  in  this  section  addresses  the  question  of 
bias  in  the  vector  of  residuals  r  .  In  particular  it  is  a  hypothesis  test 
given  by 


11^:  r  is  a  random  sample  from  a  N(0,o”)  population 

2 

11^:  r  is  a  random  sample  from  a  N(p,a~)  population 


where 


V  i  0. 


l’lie  test  statistic  is  derived  by  the  use  of  a  likelihood  ratio  test.  The 
test  statistic  has  a  t-distribution  with  N-n  -1  degrees  of  freedom  and 
is  given  by 

(N-n  -l)1/2r 

o  lOOl 


where 


r  is  the  sample  mean  of  r  ,  i.e., 


For  derivation  see  Ref.  7,  page  320. 
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and 


2 

is  the  sample  variance  given  by 
N-l 


2^(ri  '  ")2 


„  i=n 

s2  =  - 2 — 

r  N-n 


(31) 


The  test  is  conducted  as  follows: 

1.  Determine  the  significance  level  a  ,  i.e.,  what  probability 

is  the  analyst  willing  to  withstand  of  rejecting  when 

it  is  true? 

ft 

2.  From  a  t-table,  obtain  the  value  of  the  a/2  and  1  -  a/2 

percentiles  of  the  t-distribution  with  N-n^  -  1  degrees  of 

freedom.  Label  these  t  and  t  /n  .  Note  that  only  one 

a/ 2  l-a/2  J 

value  need  be  obtained,  as  t  =  -t.,  . 

a/ 2  l-a/2 

3.  Calculate  t  from  Eq.  29. 

4.  If  ta/2  —  *■  —  l-a/2  ’  t^en  Hi  is  rejected  and  no  apparent 

bias  is  present  (at  the  a-signif icance  level). 

5.  If  t  <  t  or  if  t  >  t.  ,  then  H_  is  rejected  and 

a/2  l-a/2  0  J 

there  is  significant  lias  present  (at  the  a  significance 
level)  . 

Another  way  of  stating  this  test  is  to  ask  the  question:  is  r 
significantly  different  from  zero?  If  so,  11^  should  he  rejected  and  bias 
is  present. 


An  example  of  the  use  of  these  tests  is  given  below.  Again,  the 
data  base  used  will  be  the  one  that  was  used  in  the  computer  test  run. 
Values  of  r  are  obtained  from  output  block  7,  Table  20,  Appendix  I. 
These  are  given  in  Table  15. 


Available  in  any  statistics  book,  such  as  Ref.  7. 


UNCLASSIFIED 


67 


UNCLASSIFIED 


TABLE  15 

ONE-STEP  ADJUSTED  RESIDUALS  FROM  TEST  RUN* 


Sample  Point 

Sample  Predicted  From 

Adjusted 

Residual 

6 

5 

-10.63 

7 

6 

-21.71 

8 

7 

25.95 

9 

8 

-8.038 

10 

9 

2.889 

LI 

10 

-15.67 

12 

11 

21.02 

13 

12 

37.721 

Vc 

Source,  S'L'AT  1,  Output  Block  7,  Table  20,  Appendix  I. 


To  apply  the  K-S  test,  we  must  first  order  the  adjusted  residuals 
from  smallest  to  largest.  Then  the  residuals  are  divided  by  o  ,  i.e., 
the  standard  error  of  the  estimate  from  sample  size  13.  From  the  computer 
test  run,  last  output  block  5  (Table  20,  Appendix  1)  o  =  21.6  .  By 
using  tables  of  the  Standard  Normal  Distribution,  these  latter  quantities 
are  converted  to  percentiles  of  the  Standard  Normal  distribution  (equivalent 
to  obtaining  the i r  eummuiative  distribution  function  value).  These 
operations  are  summarized  in  ’fable  16,  columns  2-4. 


These  percentiles  are  to  be  compared  to  the  endpoint  values  of  the 
steps  in  the  sample  distribution  function  given  in  Eq .  27,  Since  there 
are  eight  sample  points,  the  values  of  the  sample  distribution  function 
will  jump  by  one-eighth.  The  appropriate  endpoint  values  are  given  in 
columns  5  and  6  of  Table  15. 


The  maximum  differences  between  the  percentiles  (in  column  4)  and 
the  endpoints  (in  columns  5  and  6)  are  calculated  for  each  sample  point. 
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TABLE  16 

K-S  TEST  CALCULATIONS 


Sample 

Po  int 

Ordered 

Ad j  usted 
Residual s 

Divided  by 
Standard 
Krror  of 
list  iinaLe 

I’ercunl  1 1 e 
or  Normal 
''opulat  Ion 

Compare  to  l.iuj  Points  of 

Samp Le  Distribution  Function 

Max  i  in  urn 

D1  f fcrencu 
For  Point 

7 

-21.71 

-1.005 

0.157 

0 

0.125 

0.157 

11 

-15.67 

-0.725 

0 . 2  34 

0.123 

.250 

•  .12  5 

fi 

-10.63 

-0.492 

.  31  1 

.250 

.  375 

■  .125 

y 

I 

o 
u : 
cc 

-0.372 

.35  5 

.375 

.500 

.145 

10 

2.889 

0 . 1  34 

.55  ) 

.300 

.625 

•  .125 

12 

21.02 

0.973 

.835 

.62  5 

.750 

.210 

8 

25.95 

1  .201 

.883 

.750 

.875 

.135 

13 

37.721 

1.746 

.960 

.375 

1  .000 

•  .125 
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These.  are  shown  in  column  7.  The  K-S  statistic  for  sample  size  (number 
of  residuals)  8,  L)  ,  is  then  the  maximum  value  in  column  7.  In  the 

O 

example  under  discussion  D  =  0.21  .  This  value  is  within  acceptable 

O 

limits,  as  1)  would  liave  to  be  greater  than  0.358  for  rejection  at  as 
high  a  significance  level  (probability  of  making  a  Type  I  error)  as  0.2. 
Hence  the  regression  model  and  linear  PKR  cannot  be  rejected. 

The  calculation  for  bias  is  performed  by  first  calculating  the 

-  ► 

sample  mean,  Kq .  30,  and  sample  variance,  Fq.  31,  for  the  residuals  r  , 
(column  3,  Table  15).  These  calculations  resulted  in  values  of  r  =  3.94 
for  the  sample  mean  and  =  20.7  for  the  standard  deviation. 

The  t-statistic  is  then  given  by  bq.  29  as 
1/2- 

(N-n  -1)  r 

t  =  - % -  (29) 


Hence,  in  our  case 


t 


l/-i 

(7)  /--3.94 

20.7 


0.504 


This  is  not  a  significant  t-value  (7  degrees  of  freedom)  for  any 
reasonable  significance  level.  As  an  example,  if  a  =  0.2  ,  i.e.,  the 
0.2  significance  level,  then  the  rejection  limits  would  be  +1.42.  The  value 
of  t  obtained  above  is  not  even  close  to  being  outside  of  this  range. 

Hence  the  Historical  Simulation  results  do  not  indicate  bias  in  the  model. 


liven  though  the  values  of  these  two  statistics  are  insignif leant, 
for  the  test  run  data,  there  will  be  times  when  they  are  significant,  and 
vet  the  usual  regression  statistics  would  seem  reasonable.  To  illustrate 
this  point  consider  the  theoretical  example  portrayed  in  Fig.  4. 
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Figure  4(U) .  Theoretical  Example 


In  the  example,  the  candidate  estimating  procedure  is 

Cost  =  a  +  bX 

and  a  least  squares  curve— fitting  technique  is  used  to  pick  the  parameter 
The  time  sequencing  of  the  data  is  the  same  as  an  ordering  on  the  values 
of  X  ,  i.e.,  larger  values  came  later. 


At  the  first  stage  of  Historical  Simulation  the  first  three  data 
points  (P-p  ?2  ,  and  P^)  are  used  to  fit  a  line  £  .  The  estimate  of 

would  be  low  by  the  amount  .  At  the  next  stage  of  Historical 
Simulation,  line  would  be  derived  using  as  the  data  base  points 

Pl’  P2*  P3  anc^  P4  '  T^e  estimate  P5  derived  from  would  be 

low  by  •  The  process  is  continued  deriving  lines  from  data 

points  P^  through  P^  ,  and  from  P^  through  .  The  estimates 

of  P,  (from  £.)  and  P7  (from  £ . )  are  low  by  R„  and  R,  , 

o  j  /  A  3  4 
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respect  ivel  y .  Thus  .ill  the  predictions  obtained  were  low.  This  implies 
that  the  k-S  statistic  and  t  statistic  will  most  likely  be  significant, 
and  hence  the  estimating  procedure  would  not  be  accepted  by  Historical 
S imulat ion . 


Looking  at  each  o f  the  lines,  however,  it  does  not  seem  that  the 
fit  (to  the  data  they  were  derived  from)  is  too  bad  In  fact,  i ^  would 
probably  be  accepted  as  a  good  model  for  the  first  six  data  points,  using 

.L  J, 

A 

statistics  based  on  regression  theory.  Hence  the  model  would  be 
accepted  using  the  regression  theory  statistics  while  it  would  be  rejected 
using  Historical  Simulation. 


Uf  course,  in  this  case,  a  simple  plot  of  the  data  would  convince  an 
analyst  that  he  has  the  wrong  model  (it  should  be  exponential  rather  than 
linear).  This,  however,  is  a  consequence  of  a  two-dimensional  problem 
(Cost  and  X)  in  which  plots  can  be  made  and  our  illustration  could  be 
drawn.  The  analyst  will  not  have  the  luxury  of  such  plots  when  working 
with  more  than  one  independent  variable,  and  an  extension  of  this  example 
to  a  multiple  independent  variable  model  can  readily  be  made  (without  a 
f igure ,  however) . 


While  the  significance  of  the  t-statistic  depends  on  the  magnitude  of  the 
residuals  and  how  close  together  they  are,  the  fact  that  all  residuals 
are  negative  will  usually  lead  to  rejection  of  the  zero  mean  hypothesis. 
In  regard  to  the  k-S  test,  all  negative  residuals  implies  a  K-S 
statistic  value  greater  than  0.5.  This  is  significant  for  four  residuals 
at  the  0.2  significance  level  and  if  the  process  in  the  example  continues 
for  seven  residuals.  The  results  will  be  significant  at  the  0.05  level 

i’c 

It  should  be  noted  that  there  is  another  technique,  called  Time  Sequence 
Plot  of  the  residuals  (Ref.  (> ,  page  bd)  ,  which  for  the  example  being 
discussed  wculd  result  in  a  sequencing  of  residuals  from  the  usual 
regression  analysis  that  would  indicate  a  lack  of  fit.  However,  the 
consequences  of  retaining  the  model  (in  this  example),  i.e.,  the  likeli¬ 
hood  of  underestimates,  are  more  apparent  when  processed  by  Historical 
Simulation.  Furthermore,  even  though  residual  plots  should  be  analyzed 
whenever  a  least  squares  curve  fit  is  made,  the  fact  is  that  such 
examinations  of  residuals  are  often  forgotten. 
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It  should  be  noted  that  the  two  tests  discussed  in  this  section  are 
very  different.  The  test  for  bias  assumes  that  the  underlying  model  is 
normal  and  all  that  is  being  tested  is  if  the  mean  is  zero.  The  Kolmogorov- 

Smirnov  test,  on  the  other  hand,  asks  whether  or  not  the  distribution  is 

-2 

normal,  with  mean  0  and  variance  .  Both  tests  address  the  question 

of  model  validity,  however,  as  the  residuals  should  theoretically  come 
2 

from  an  N(0,o  )  population. 

It  is  expected  that  other  tests  can  be  constructed  for  the  one-step 
residuals.  In  addition  to  the  above  and  extensions  of  them  to  tests  applied 
to  all  the  residuals  (after  some  or thogonalization) ,  it  will  be  worthwhile 
to  develop  two  hypothesis  tests  where  is  some  other  candidate 

estimating  procedure.  If  this  alternative  is  also  a  linear  PER,  with  the 
assumed  multiple  linear  regression  model,  then  such  tests  should  be 
relatively  easy  to  construct.  If  the  alternative  is  a  nonlinear  PER,  then 
the  appropriate  statistical  distribution  will  have  to  be  identified  and 
the  distribution  of  the  Historical  Simulation  predictions  and  residuals  will 
have  to  be  derived.  Then  the  question  of  devising  tests  can  be  addressed. 
Needless  to  say,  this  last  group  of  tests  will  take  considerable  effort. 


3.  Comparison  Statistics 


The  last  set  of  statistics  that  have  been  identified  are  some  of 


the  usual  regression  statistics  for  each  of  the  subsample  fits  in  Historical 
Simulation.  (They  have  nothing  to  do  with  the  prediction  and  residual 
output  of  Historical  Simulation.)  These  can  be  used  in  the  usual  manner 
to  see  how  well  the  estimating  procedure  is  doing  on  each  of  the  subsamples. 
They  also  can  be  directly  compared  to  like  statistics  on  the  entire  sample. 


Example  values  for  the  test  run  can  be  seen  in  output  blocks  5,  Table  20, 
Appendix  I. 
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The  first  set  of  measures  are  best  summarized  as  Measures  of  Fit. 
Two  such  measures  are  calculated  on  each  subsample  for  which  parameters 
are  estimated.  These  measures  are  given  below: 


Standard  Error 
of  the  Estimate 


| 

.1  or  Coefficient!  _  i=l 
of  Determination  \  m 

)  2>* 


whore  m  =  subsample  size 

=  actual  cost  of  the  ith  object 

P.  =  estimate  of  the  ith  object  (Fitl 
1  — 

k  =  number  of  parameters  to  be  estimated  in  the  PER 
and  A  is  the  average  of  the  A^'s,  i.e., 


m 


i=l 


These  measures  are  not  at  all  related  to  the  predictions  calculated 

from  the  CER  that  is  derived  by  fitting  the  curve  to  the  subsample.  They 

merely  describe  how  good  the  fit  was.  In  theory,  if  the  process 

2 

satisfies  the  statistical  assumptions,  SEE  should  be  converging  to  the 

2  2 

true  variance  o  ,  and  R  should  be  converging  to 


-  a  r 

-  (33) 

_  •> 

-  a  r 
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Therefore,  as  Che  sample  size  increases  through  the  Historical  Simulation 
evaluation,  we  should  see  this  convergence  (although  for  most  cost  applica¬ 
tions  the  number  of  samples  fitted  will  probably  be  too  small).  In 

2 

practice  it  is  desirable  for  a  to  be  small,  and  therefore  a  good  fit 

2 

is  represented  by  a  SHE  close  to  zero  and  an  close  to  one. 

Another  set  of  fit  statistics  are  the  t-statistics  for  each  estimated 
coefficient  of  the  linear  model.  These  are  the  statistics  that  are 
usually  used  to  see  if  a  coefficient  is  significantly  different  from 
zero . 


Given  that  these  coefficients  are  all  different  from  zero  in  a 
usual  regression  run  (i.e.,  for  the  entire  sample),  it  may  turn  out  that 
they  are  not  significant  for  all  of  the  subsample  fits  processed  in 
Historical  Simulation.  It  seems  reasonable  that  once  the  subsample  size 
was  large  enough  for  ail  to  be  significant,  then  they  should  remain 
significant.  If  not,  one  might  begin  to  question  the  value  of  retaining 
the  independent  variable  that  corresponds  to  the  occasionally  significant 
coefficient . 

Note  also  that  the  fact  that  a  particular  coefficient  is  not 
significant  for  early  data  bases  brings  into  question  the  relevance  of 
that  data  base  to  the  current  prediction  problem.  It  may  be  useful  to 
try  estimating  procedures  which  ignore  this  early  data.  Such  a  procedure 
would  be  one  that  estimates  the  parameters  using,  say,  only  the  last  6 
data  points  in  time. 
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In  summary,  these  statistics  are  useful  in  seeing  how  the  fit  is 
improving  as  the  sample  size  grows.  They  do  not,  however,  pertain  to 
the  main  output  of  Historical  Simulation,  i.e.,  the  predictions  and 
residuals.  They  should  shed  some  light  on  any  anomalies  present  in  this 
latter  output,  however,  and  may  be  useful  in  suggesting  new  candidate 
estimating  procedures. 

This  concludes  the  discussion  of  the  uses  of  the  output  from 
Historical  Simulation  and  the  work  to  date  on  its  development.  Next  i. 
seems  appropriate  to  summarize  the  advantages  and  current  limitations 
of  Historical  Simulation  and  indicate  the  direction  of  possible  future 
research.  These  topics  are  discussed  in  the  next  section. 
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V.  CONCLUSION  AND  RECOMMENDATIONS  FOR  FUTURE  EFFORT 

In  concluding  this  report,  it  will  be  useful  to  summarize  the 
limitations  and  advantages  of  Historical  Simulation  as  it  is  currently 
envisioned.  This  section  will  be  itself  concluded  with  some  recommendations 
for  future  work  which  would  hopefully  shed  further  light  on  some  of  the 
limitations  noted  and  expand  on  the  work  already  completed. 

A.  CURRENT  LIMITATIONS 

That  limitations  exist  is  not  always  bad,  as  the  following 
discussion  will  show.  However,  there  are  areas  where  the  development  of 
Historical  Simulation  is  far  from  complete  and  the  attendant  limitations 
are  a  real  problem.  These  limitations,  as  the  author  currently  sees  them, 
are  discussed  below. 

1 .  Lack  of  a  Single  Way  to  Interpret  the  Output 

Whether  this  is  really  a  limitation  or  not  is  open  to  question.  It 

would  certainly  be  more  convenient  if  one  summarization  could  answer  all 

our  questions  about  a  cost  estimating  procedure's  reliability  and 

validity.  But  this  type  of  convenience  is  not  even  present  in  the  use 

of  regression  theory,  as  can  be  seen  from  the  several  statistics  that 

,  2 

must  be  calculated  (e.g.,  R  ,  standard  error  of  estimate,  and  prediction 
intervals).  Furthermore,  the  lack  of  such  a  convenient  data  summarization 
has  the  effect  of  forcing  the  analyst  to  examine  the  residuals  (Table  9), 
something  that  should  be  done  anyway. 

2 .  Lack  of  Ability  to  Uniquely  Specify  the  Minimum  Sample  Size,  n^  , 
for  Historical  Simulation 

As  discussed  in  Sec.  Ill  C,  the  specification  of  a  minimum  sample 
size  is  not  a  trivial  problem.  To  be  sure,  there  is  a  lower  bound 
(depending  on  the  number  of  PER  parameters)  below  which  the  value  of  n 
cannot  be  defined,  but  this  lower  bound  is  just  a  starting  point  in  the 
specification  of  n^ . 
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If  too  low  a  value  is  specified,  there  may  not  be  enough  degrees  of 
freedom  for  initial  predictions  to  be  very  meaningful.  On  the  other  hand, 
too  large  a  value  of  n^  greatly  diminishes  the  Historical  Simulation 
output.  Each  additional  sample  point  included  in  the  initial  subsample 
deletes  a  row  from  the  prediction  and  residual  output  matrices.  Hence 
the  analyst  must  specify  n^  to  be  the  smallest  number  for  which  the 
estimating  procedure,  if  valid,  will  have  enough  information  from  which 
to  make  reasonable  predictions. 

3 .  Loss  of  Information  in  the  Data  Summarizations  and  Statistics 
Derived  in  Sec.  IV 

Due  to  a  lack  of  independence,  summarizations  suggested  in 

Secs.  IV  B  and  IV  C  have  only  made  use  of  one  residual  calculation  for 

each  sample  point,  usually  the  one-step  residuals  d^n^  .  Hence  a  great 

n+1 

deal  of  information  goes  unused.  Further  research  should  be  initiated 
to  try  to  incorporate  the  unused  information  in  the  recommended  summariza¬ 
tions  and  tests.  Some  nonparametric  statistical  techniques  might  prove 
useful  for  the  summarizations  that  do  not  depend  on  a  particular  estimating 
procedure  while  or thogonalization  techniques  could  be  applied  to  the 
residuals  that  are  based  on  the  Linear  PER-Least  Squares  procedures. 

4 .  Lengthy  Output  Time  Requirements  for  the  Time  Share  Computer  Model 
The  output  time  requirements  for  operation  of  the  computer  model  on 

the  GE  time  sharing  service  seem  undesirably  long.  Thirty-one  minutes 
of  terminal  time  was  required  for  the  test  run,  Table  20,  Appendix  I. 

There  are  no  inherent  reasons  for  this.  It  is  probably  possible  to 
write  the  program  or  program  output  format  more  efficiently.  Another 
possibility  is  to  convert  the  program  to  a  non-time-sharing  machine  with 
more  efficient  output.  Since  the  program  has  been  written  in  FORTRAN, 
this  latter  course  should  pose  few  problems. 
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5 .  Lack  of  Application  in  the  Development  of  Actual  CERs 

The  usefulness  of  Historical  Simulation  will  ultimately  be  decided 
by  the  analyst.  Several  ways  of  using  the  output  have  been  suggested  in 
Sec.  IV.  Their  value  in  selecting  between  several  candidate  cost 
estimating  procedures  and  in  hypothesizing  new  cost  estimating  procedure 
candidates  can  only  be  evaluated  through  their  attempted  application. 

From  this  process,  it  is  expected  that  new  uses  of  the  output  will  be 
created  and  perhaps  some  of  the  suggested  uses  discarded. 

Some  examples  of  the  application  of  Historical  Simulation  are  given 
in  Volume  2  of  this  report.  They,  however,  do  not  serve  to  remove  this 
limitation,  as  a  much  greater  exposure  is  required  to  fully  understand 
the  practical  worth  of  Historical  Simulation.  Furthermore  the  author 
lacks  the  necessary  understanding  of  either  the  data  base  or  the  example 
aircraft  programs  to  fully  utilize  the  Historical  Simulation  output. 

6 .  Lack  of  a  Precise  Understanding  as  to  the  Situations  for  Which 

Historical  Simulation  Will  be  More  Valuable  Than  Regression 

Techniques 

Insights  into  the  relationship  between  these  two  techniques  have 
been  achieved  in  Sec.  IV  C  and  Appendixes  II  and  III.  The  fact  that 
there  are  situations  in  which  Historical  Simulation  will  be  more  valuable 
is  clear  (see  Fig.  4,  pg.  7l).  Also  it  seems  clear  that  Historical 
Simulation  provides  a  greater  visibility  (e.g.,  the  identification  of 
questionable  sample  points  or  the  demonstration  of  successful  extrapolations) 
than  the  usual  regression  techniques,  even  when  the  conclusions  reached 
by  the  two  techniques  are  the  same. 

However,  a  precise  understanding  of  all  the  possible  situations  for 
which  one  of  the  techniques  is  more  valuable  will  probably  never  be 
reached.  This  fact  leads  to  the  final  limitation. 
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7 •  Historical  Simulation  is  Not  the  Ultimate  Answer,  Merely  Another 
Tool 

Used  in  conjunction  with  such  traditional  methods  as  regression 
theory.  Historical  Simulation  should  improve  the  quality  of  our  CERs . 
Furthermore,  no  technique,  including  Historical  Simulation,  will  ever 
remove  the  necessity  for  the  analyst.  He  is,  in  fact,  an  integral  part 
of  the  evaluation  procedure.  He  must  choose  candidate  estimating  proced¬ 
ures,  examine  the  output  tables,  ciioose  loss  functions  and  weighting 
schemes,  etc.  Hence  the  best  that  can  be  done  is  to  provide  him  with  as 
many  useful  tools  as  possible  to  best  perform  his  analysis. 

B.  ADVANTAGES 

Several  unique  advantages  of  the  Historical  Simulation  procedure  have 
been  identified  throughout  this  report.  These  are  summarized  below. 

I .  Historical  Simulation  Can  Compare  a  Wider  Class  of  Cost  Estimating 
Procedures  Than  the  Usual  Regression  Techniques 

Section  III  demonstrated  the  ability  of  Historical  Simulation  to 
evaluate  any  cost  estimating  procedure. 

2  .  Historical  Simulation  Provides  an  Easy-to-Communicate  Summary  Statistic 

Useful  for  Describing  the  Accuracy  of  a  Prediction 
This  summary  statistic  is  average  proportional  (or  absolute)  error 
or  one  of  its  weighted  forms.  While  it  does  not  summarize  all  of  the  His¬ 
torical  Simulation  output  it  does  describe  how  well  the  cost  estimating 
procedure  would  have  predicted  if  it  had  been  used  in  the  past  to  make 
predictions  of  the  now  historical  data. 

J .  Historical  Simulation  Provides  a  View  Independent  of  the  Usual 
Regression  Theory  Approach 

This  independent  view  is  a  consequence  of  the  fact  that  Historical 
Simulation  evaluates  the  ability  of  a  candidate  cost  estimating  procedure 
to  predict  the  future  from  the  past.  Historical  Simulation  does  not 
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depend  on  how  well  the  candidate  cost  estimating  procedure  fits  the  data. 
Consequences  of  this  are 

1.  An  independent  view  of  CERs  derived  from  stepwise  regression 

2.  Additional  information  to  help  hypothesize  a  new  cost  estimat¬ 
ing  procedure  candidate 

3.  Exposure  of  questionable  sample  points  which  do  not  fit  in 
with  the  prior  data  base  in  terms  of  information  content  for 
parameter  estimation  and  in  terms  of  simulated  predictions. 

4.  A  demonstration  of  the  candidate  estimating  procedure's 
ability  to  extrapolate  from  historical  data  to  make  predictions. 

5.  The  possibility  of  uncovering  errors  in  an  estimating 
procedure's  formulation  which  would  not  be  uncovered  by  the 
usual  regression  statistics,  e.g.,  Fig.  4,  page  71. 

C.  RECOMMENDATIONS  FOR  FUTURE  EFFORT 

This  report  has  described  the  work  accomplished  to  date  on  the 
development  of  Historical  Simulation.  It  is  the  author's  opinion  that 
the  procedure  has  been  developed  sufficiently  and  offers  enough  advantages 
for  it  to  be  usefully  applied  by  those  analysts  in  industry  and  government 
involved  in  the  development  of  CERs . 

However,  as  we  have  pointed  cut  in  this  section,  there  are  limita¬ 
tions  that  should  be  examined  so  that  the  Historical  Simulation  procedure 
can  be  more  fully  developed  and  hence  more  meaningfully  applied.  The 
future  effort  required  should  proceed  along  two  distinct  paths,  one 
theoretical,  the  other  applied. 

On  the  theoretical  side,  three  classes  of  problems  can  be  identified 
for  future  investigation. 

1 .  Incorporation  of  more  of  the  residual  output  into  the  suggested 
statistics  and  tests:  Examples  were  discussed  in  limitation  3. 
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Determination  of  the  probability  of  selecting  the  wrong 
estimating  procedure  when  using  the  Sec.  IV  B  summary  statistics 


(e.g.,  average  Loss)  for  ranking:  Monte  Carlo  techniques 
applied  to  the  usuaLiy  assumed  statistical  models  for  the 
candidate  estimating  procedures  might  be  used  for  this 
investigation . 


Determination  of  the  theoretical  distribution  of  the  Historical 


Simulation  residuals  for  estimating  procedures  other  than 
Linear  PER-Least  Squares  Procedures:  Exponential  PERs, 

Eq.  2,  utilizing  a  iog-linear  curve  fitting  technique  are 
examples  of  alternative  estimating  procedures  that  should  be 
exp  Lored . 


On  the  applied  side,  the  use  of  Historical  Simulation  in  the 
development  of  CERs  should  be  encouraged.  This  work  should  be  carried 
out  by  individual  analysts  engaged  in  the  development  of  CERs,  for  only 
they  will  have  the  knowledge  of  their  data  base  and  of  the  pnysical 
makeup  of  the  class  of  procurements  under  investigation  necessary  to 
interpret  the  Historical  Simulation  output  and  to  hypothesize  new  cost 
estimating  procedure  candidates.  Of  course,  reporting  of  the  successes, 
failures,  or  extensions  of  the  Historical  bimulation  procedure  which  are 
discovered  in  specific  applications  should  also  be  encouraged. 
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APPENDIX  I 

COMPUTER  PROGRAM  DESCRIPTION— LINEAR 
PER-LEAST  SQUARES  CLASS  EXAMPLE 


A.  GENERAL  REMARKS 

In  this  appendix  the  relationship  of  the  Historical  Simulation 
procedure  to  an  estimating  procedure  is  described  in  detail  by  examining 
a  computer  program  developed  for  Historical  Simulation.  This  program 
has  been  written  in  FORTRAN  for  the  G.E.  Time  Sharing  Service,  MARK  I; 
the  main  program  is  listed  in  Table  17.  In  describing  this  program  the 
flow  diagram  of  Fig.  5  will  be  followed.  The  figure  is  divided  into  two 
parts.  On  the  left,  under  the  title  of  Main  Program,  are  those  calcula¬ 
tions  which  are  not  dependent  upon  a  particular  estimating  procedure. 

To  these  operations  the  calculations  peculiar  to  a  given  estimating  pro¬ 
cedure  are  added,  as  portrayed  in  the  right  side  of  Fig.  5,  under  the 
heading  Estimation  Procedure. 

In  theory,  a  set  of  operations  should  be  supplied  for  each  estimatin 
procedure  being  tested,  but  fortunately  it  appears  that  these  operations 
can  be  more  gei  rally  written  around  classes  of  estimating  procedures.  As 
an  example,  a  set  of  operations  written  for  estimating  procedures  which 
use  the  lease  squares  fit  technique  and  a  (multivariate)  linear  PER  will 
be  discussed.  The  multiple  linear  regression  model  is  usually  assumed 
for  this  class  of  estimating  procedures  and  the  class  lias  been  referred 
to  as  the  Linear  PER-Least  Square  Procedures. 

In  writing  the  program  the  operations  under  Estimation  Procedure 
were  organized  into  four  subroutines.  Since  this  organization  will 
probably  be  useful  for  any  estimating  procedure  (that  can  be  automated), 
it  will  be  useful  to  document  it  here.  The  subroutine  names,  appropriate 
box  numbers  from  Fig.  5,  and  table  numbers  for  a  complete  listing  of  the 
programs  are  listed  in  the  following  table. 


UNCLASSIFIED 


81 


•y- 


UNCLASSIFIED 


MAIN  PROGRAM 


4b 

SET  UP 
NEXT 

HISTORICAL 

SAMPLE 


I 


ESTIMATION  PROCEDURE 


SET 

UP 

USABLE 

DATA 

BASE 


SET  UP 

INITIAL 

HISTORICAL 

SAMPLE 

BASED  ON 

MINIMUM 

SAMPLE  SIZE 


NO/  HAS  ENTIRE 
I  \  SAMPLE  BEEN 
USED?  / 


MAKE 

PREDICTIONS 
A  .D  STORE 
DATA  FOR 
SUMMARY 
STATISTICS 


CALCULATE 

summary 

S’ATISTJCS 


USE  TECHNIQUE  TO 

-  CALCULATE  PARAMETERS 

OF  P  E  R  ? 


CALCULATE  ANY 
DESIRED  ITERATION 
OUTPUT  STATISTICS 
VALID  FOR  TECHNIQUE 


CALCULATE 
PREDICTIVE 
STATISTICS 
PECULIAR  TO 
TECHNIQUE 


PROGRAM  PASSES  TO 
NEXT  STAGE 

INFORMATION  ONLY  IS 
PASSED 

ITERATION  ONE  SAMPLE 
POINT  AT  A  TIME  UNIT 


Igure  5(1’) .  Relationship  of  Historical  Simulation  and  the  Estimation 

Procedure 


UNCLASSIFIED 


AN-17512 


UNCLASSIFIED 


Subroutine 

Name 

Operation  Number 
(from  Figure  5) 

Program  Listing, 
Table  Number 

DESCP 

2,  10 

18 

TECH 

5,  6a 

19 

EST 

6b 

18 

SST 

8b 

18 

In  Table  20,  an  example  is  given  of  the  Historical  Simulation  out 
put  (using  a  Linear  PER-Least  Squares  procedure).  The  data  do  not  rep¬ 
resent  any  real  data  base  and  the  reader  is  therefore  cautioned  about 
drawing  conclusions.  Examples  with  aircraft  and  helicopter  data  are 
given  in  Vol .  2  of  this  report. 

As  the  program  is  being  discussed,  reference  will  frequently  be 
made  to  Tables  17  through  20.  The  contents  of  Tables  17  through  19 
will  be  referred  to  by  line  number.  The  contents  of  Table  20  will  he 
reterenced  by  output  group  (numbers  1-7  in  the  left  margin). 


UNCLASSIFIED 


85 


UNCLASSIFIED 


TABLE  17 

MAIN  PROGRAM 


mss 


0  SFILE  RUNDAT .TOTDAT 

1  COMMON  I DV (6)  ,NIV ,EMU( 7) , RDATA( 42 , 7) , VAL(6) , 

2  +AINV( 7 , 7) ,AVC( 7) ,SSE 

3  DIMENSION  DATA( 42 , 7) , R£ST(42 , 7 ) , ITBAT (42 )  ,NORDE (42) ,NW(4 2) 

4  EQUIVALENCE  (DATA .REST) 

9C  SAMPLE  INPUT  AND  TIME  BATCHES 

12  READ(l) .NUMV.NUMS.OUT.NUMT 

13  NUMV1-NUMV  +  I 

14  READ(l) ,( ITBAT(l)  ,I»1,NUMT) 

15  15  READ  (1)  ,(NW( [) ,1  =  1 ,NUMS) 

17  17  READ(2)  ,((DATA(I ,J) .J-l.NUMVl)  ,I=1,NUMS) 

18  READ  (2) ,NORD 

19  IF(NORD  -  1)  29 

20  READ(2)  ,  (NORDE  ( I )  ,  1=  1  ,NUMS) 

21  REWIND  2 

22  READ(2)  ,  ((RDATA(I ,J) ,J=1,NUMV1) ,1=1, NUMS) 

25  DO  25,  1=1, NUMS 

26  DO  25,  .1  =  1  ,NUMV1 

27  DATA (NORDE ( 1 ) , J )  =  RDATA(I.J) 

28  25  RDATA (  I  , J ) =0 

29  29  REWIND  2 

10  IF  (3-OUT)  40 

11  PRINT, 

1)  PRINT,"  SAMPLE  DATA” 

17  PRINT  28, 

41  28  FORMAT  ( 4HSMP .  ,9X,9HACT.  VAL  .  ,  25X  ,  1 9HVAL.  OFINDF.P.  VAR. 

4  2  +/ 3  UNO.  ,  2  7X,  HHX  1,(4, 7)  ,9X ,  8HX2  ,  (  5 , 8)  ,9X  ,811X3  ,  (6,9) ) 

5  J  DO  32, 1-1, SUMS 

57  12  PRINT  )  3 ,  I , DATA( I  .NUMV+ 1 ) , ( DATA ( I , J )  ,  J-  1  ,NUMV) 

hi  1)  FORMAT (1),1X,4F17.2,FJ6.2,2F17.2,F36.2,2F17.2) 

62  OUT-OUT+J 

65C  OBTAIN  TECHNIQUE  DESCRIPTION 

69  ~*0  CALL  DESCPI MSAMS ,NUMV) 

70  MV1-SIV+1 

7  1C  REDUCE  DA  I A  FOR  THIS  TECHNIQUE 

85  DO  60,1  =  1  .SUMS 

86  RDA  I' A  ( I  ,  N  I  V 1 )  =  DATA(  I  .NUMVl) 

8  7  DO  60, J- 1  , N I V 

88  K-IDV(J) 

89  60  RDA T A ( I  ,  J )  =  DATA ( I , K ) 

10 1 C  SET  UP  FIRST  SAMPLE 

105  I  F ( MSAMS -NUMS ) 85 ,85 

109  PRINT,"  SAMPLE  TOO  SMALL" 

1 1 1  STOP 

117  85  NUMS!  =  0 

12  1  DO  90  ,  1  =  1  , NL'MT 

125  ORCSS-NUMSl+lTBATO) 

129  IF  ( ORCSS -MSAMS) 90 , 100 , 100 
111  90  NUMS  1  =NUMS  i4-  ITBAT  ( I ) 

1  I7C  SET  UP  SAMPLES 


UNCLASSIFIED 


UNCLASSIFIED 


TABLE  17  (cont'd) 

MAIN  PROGRAM 

HISS  CONTINUED 

141  1(10  DO  400  ,.1A=  I  ,NUMT 

143  PRINT;  PRINT;  PRINT, 

145  NUMS1  -  NUMS1  +  ITBAT(.IA) 

149  1  FCNUMT-.IA)  1  10,120,130 

153  110  PRINT,"  SAMPLE  SET  JP  WRONO" 

157  STOP 

161  120  PRINT,"  ENTIRE  SAMPLE  USED" 

177  130  PRINT  136.NUMS1 

181  136  FORMATt  14HSAMPLE  SIZE  -  ,13) 

188  X-l. 

189  CALL  TECH(OUT.NUMSl) 

193  PRINT, 

197C  TEST  TO  SEE  IF  DONE 
201  1 F(NUMT-JA)  110,410 

209C  SET  UP  PREDICTION  OUTPUT;  CALC.  STAT. 

216  NUMS1 1- NUMS 1+1 

217  DO  200 , 1-NUMS 1 1  ,NUMS 

220  REST(I,5)-RDATA(I  ,NIV1) 

221  DO  167 , J* 1 ,NI V 

225  167  VAL ( .1 )  >  RDATAd  ,J) 

226  REST(  1 , 1)-REST(  1 ,  5) ;  REST(  1 , 2)  »EST(X)  ;REST(  1 ,6) -REST (1,2) 

2  33  CALL  SST(REST(  I  ,5)  .REST ( 1  ,6)  ,NUMS1 ) 

2  39  REST ( I  ,  3) -REST  ( I  ,  2  )  -  REST(  1  , 1 ) 

740  200  REST (1,4)  “REST (1,3)  /  REST (1,1) 

250  I F( 5-OUT)  400 

251  PRINT, 

252  PRINT,"  PREDICTIONS" 

253  PRINT, 

254  PRINT  185, 

255  DO  400 , 1-NUMS  11  , NUMS 

2  56  PRINT  190 ,  ( REST ( I , J ) , J- 1  ,h) 

257  400  CONTINUE 

260  185  FORMAT  (6 X,  6HACTUAL  ,SX,  8HEST1  MATE  ,  3X,  10HDI  FFERENCE  ,2X, 

261  +9IIPROP.ERR.  .4X.6HSTAT.  1  .-X.hHSTAT.  21 

265  190  FORMAT ( 6 F 1  2  .  )  1 

2  77C  FINAL  OUTPUT 

281  “10  PRINT, 

285  PRINT,"  FINAL  OUTPUT" 

2  89  NSW  -  0;  APE  •  0.;  hi  A  -  0.;  SKI  «  0.:  SK2  •  0. 

29  1  PRINT  185, 

296  ORCSS1  •  ORCSS+ 1 

29  7  DO  4  10  l-ORC.SSl  .SUMS 

101  PRINT  190, (REST!  I  ,  I  ,J-  1  ,M 

103  APE  -  APE  +ABSF(RFST(  1  ,-n*SW<  I  i 

304  415  BIA  -  B 1 A  4  REST!  1  .-l>*NW<  1) 

309  -  10  NSW  -  NSW  +  NW(  1  I 


110 

PRINT 

;  PR1  NT ; 

118 

APK  - 

APK  'NSW  ;  HI  A  ■=  BIA  NSW 

319 

DO  m  J 

7  I  -  0 ROSS  1 , NUMS 

320 

RKD  - 

ABSF(RKST(t %U))  -  APF 

322 

SKI  = 

SKI  +  NW ( I ) *RFD**J 

32  3 

uv 

sk:  =  sk:  nw(d*rfd*m 

325 

PR  IN 

,‘AVF.  PROPORTIONAI  ERROR 

126 

PRINT 

/’BIAS 

327 

PRINT 

, "SKEWNESS 

350 

GO  TO 

'  17 

351 

END 

900 

SUSE 

HISST 

9  16 

SllSF 

HISS  ) 

9  38 

sirs  k 

HI  SSI 

940 

SOPT 

SIZE 

UNCLASSIFIED 


UNCLASSIFIED 


TABLE  18 
HISS  3 

THREE  SUBROUTINES 


11ISS3 


335  SUBROUTINE  DESCP (MSAMS , NUMV) 

339  COMMON  IDV(b)  ,NIV,EMU(7)  ,RDATA(42 , 7) , VAL(6) , 

340  +  AINV(7,7) ,AVG(7) ,SSE 

34 7 C  SPECIFY  MINIMUM  SAMPLE  SIZE  AND  INDEP .  VARIABLE 
351  READ ( 1)  ,  NIV 
359  MSAMS=N IV+2 
363  DO  518  J=1 , NIV 

366  READ  (1) ,IDV(J) 

367  518  IF(NUMV-IDV(J))  545 
375C  OUTPUT  TECHNIQUE 

379  PRINT;  PRINT, 

387  PRINT,"  LINEAR  PER  -  LEAST  SQUARES” 

391  PRINT  535.NIV, (IDV(J) , J=1,NIV) 

395  535  FORMAT (20HNO .  OF  INDEP.  VAR.=  , II , 3X , 18HVARIABLE 
399  +  NOS.  ARE  ,913) 

403  RETURN 

407  545  PRINT,"  UNDEFINED  VARIABLE  CALL  IN  DESCP" 

411  STOP 
415  END 

.20  FUNCTION  ESI (X) 

424  COMMON  1DV (6)  ,N I V , EMU ( 7 ) , RDATA(42 , 7 ) , VAL (6)  , 

425  +  A I N  V ( 7 , 7 ) ,AVG(7) ,SSK 
4  32  LST=  EMU  (.N I V+ 1 ) 

4  3o  DO  5  55  ,  J=  1  ,N  IV 

4,0  555  EST=  EST+  EMU ( J ) *  VAL  ( J ) 

444  RETURN 
44 8  END 

380  SUBROUTINE  SST(S1,S2,  NUMS1) 

r->81  COMMON  I  DV (6 )  ,NIV,EMU(7)  ,RDATA(42,7)  ,VAL(6)  , 

Jb2  +  A I N  V (  7  ,  7 )  ,  AVG (7)  ,SSL 

384 C  TIE ST  OF  NEW  POINTS 

58 3  S I D=  (L.  +  l  ./FL0ATF(NUMS1))*SSE**2 

,8t>  DO  3 90,  1  =  1  , N  1  V 

>87  DO  390,  4=1  , N  1 V 

390  390  SI  D=S  1 1H-  ( VAL  (  l )  -AVG  ( I )  )  *  A I N  V  ( I ,  J  )*  (VAL(J)  -AVG  ( J ) ) 

59  L  SI  =  SSL*  (,  S2-S1 )  / S I D**  .5 
>'iji  M)  SECOND  STATISTIC 
>'•(,  S3  =  0. 

>98  R!  TURN 
','|9  i  ND 


88 


UNCLASSIFIED 


UNCLASSIFIED 


TABLE  19 
HIS  ST 

SUBROUTINE  TECH 


HIS  ST 


600  SUBROUTINE  TECH (OUT, NUMS1) 

601  COMMON  IDV(6)  ,NIV,EMU(7) , RDATA (4 2, 7) ,VAL(6) , 

602  +  AINV(7 ,7) ,AVC(7)  ,SSE 
608  NIV1=NIV+1 

610C  CALCULATE  ARITHMETIC  MEANS 
615  DO  630 , 1=1 ,NIV1 

620  AVG(I)=0 . 

621  DO  625 , J=1 ,NUMS1 

625  625  AVG(I)=AVG(I)+RDATA(J  ,  I) 

630  630  AVG(I)=AVG(I) /NUMS1 

650C  CLEAR  CROSS  PRODUCT  MATRIX  AND  VECTOR 

651  DO  655  1=1 ,NIV1 

652  EMU(I)=0. 

653  DO  655  J=1,NIV1 
655  655  AINV (1 , J)=0 . 

6 70C  FORM  CROSS  PRODUCT  MATRIX  AND  VECTOR 
671  DO  680  1=1, NIV 
673  DO  677  J=1,NUMS1 

6  74  EMU  ( I)  =EMU  (I)  +  (RDATA  (J  ,  NIV1)  -AVG  (NIV1))*  (RDATA  ( J  ,  1 )  -AVG(  1) ) 

676  DO  677,  K=I,NIV 

67  7  677  AINV(I,K)=AINV(I ,K)  +  ( RDATA (J , I) -AVG ( I ) ) * (RDATA (J ,K) -AVG (K) ) 

678  DO  680  ,K=I ,NIV 

680  680  AINV(K,I)  =  AINV  (1,K) 

700C  INVERT  MATRIX 

702  CALL  MTINV(D, ID) 

703  IF  (ABSF(D)-. 00001)863 
705C  SET  UP  ESTIMATOR  VECTOR 
710  EMU (N I VI )=AVG (N I VI ) 

712  DO  720, 1  =  1  , NIV 

720  720  EMU (N I VI )  =  EMU (N I VI ) -AVG ( 1 ) *EMU ( 1 ) 

721  EVAK  =0.;  EBAR  =0.;  EEX  =  0. 

722  IFl-XABSFlOUT-4) )  741 

725C  ESTIMATE  VARIANCE;  OUTPUT  ESTIMATES 

727  PRINT,"  SAMPLE" 

728  PRINT  730, 

7  30  7  30  FORMAT  ( 5X , (■  HACTUAL ,  1  2 X  , 8H ESTIMATE  ,  10X  ,  1  0111)  1  FFERENCE ,  5X  , 

7  31  +1  3'IPROPOR .  ERROR) 

740  740  FORMAT  ( E 1  3 . 5 , 2  E 1 9 . 5  ,  E 1  7 . 5 ) 

741  741  DO  755,1  =  1 , SUMS  I 
74  3  DO  745  ,J  =  1 , N I V 


UNCLASSIFIED 


UNCLASSIFIED 


TABLE  19  (cont'd) 
HISST 

SUBROUTINE  TECH 


II1SST  CONTINUED 


749  745  VAL ( J )  =RDA'1’A ( 1 , J ) 

747  E=EST( D) ; A=RDATA( 1 ,NIV1) ;B=E-A;C=B/A 

748  I. F ( -XAB S F ( 0 UT -4 ) )  7 5 0 

749  PRINT  740,  A,  E,  B,  C 

750  750  EBAR  =  EBAR  +  (A  -  AVG(NIVl)  )**2 

751  EVAR  =  EVAR  +B**2 

755  755EEX=EEX+(E-AVG (N I VI) ) **2 
760C  CALCULATE  OUTPUT  STATISTICS 
780  780  PRINT, 

790  SSE= (EVAR/ (NUMS1-NIV1) )** . 5 

800  800FORMAT  (20IISTD .  ERROR  OF  EST.  =,E15.5,10X, 

801  +12HR  SQUARED  =  ,E15.5) 

805C  SET  UP  VAR  -  COV  MATRIX 
806  L)0  810,  K=  1  ,N IV 

810  810  VAL (R)  =  0. 

820  DO  825,  K-l.NIVl 

824  DO  825  , J  =  l,  MV 

825  825  VAL (R)  =  VAL(K)-AINV(K,J)*AVG(J) 

827  VAL (N IV1 )=i /NUMS1 

828  DO  810,  E= 1 , N 1 V 

810  810  VAL (v 1 V1)=VAL (N IV1 )  -  VAL(R) *AVG (K) 

813  DO  840,  K=  1  ,N IV 
8  Id  DO  817,  J - 1 , N 1 V 
817  817  A I N V ( K  ,  J )  =  A1NV1.K,  J  )  *SSE,’  *2 

840  840  A 1 N V ( N  1 V 1  ,K)=VAUK)  *SFu**2  ;A1NV(K,N  1V1)=A1NV(NIV1  ,K) 

8-t  3  A  I NV  (N  i  VI  ,N  1 VI  )  =  VAL  (N  IV 1 )  *5811**2 
8  -i  t  >  PRIM, 

8-.  7  PRINT,"  BUB  SAMPLE  STATISTICS" 

8-8  PRINT, 

8  .*(  PRINT  800,  SSI  ,1. LX/I. BAR 

830  PRINT, 

831  PRIM  8t>0  ,NUMS  1  -N  1 V 1 
8  DO  83  5,1  =  1  ,N  IV 

8,5  8  ;  5  PRIM  BM  ,  i  ,  I.  ML  <  I  ),  EMU  (  I ) /A  INV(  I  ,  I  )**.  5 

8  .  .  PRINT  8i.2,I.MT IMVI  ) 

8>/  RIll'RN 

8.. 0  8i  >0  FORMA  I  ( SI  I  VAR  l  ABLE,  1 IX  ,'MIPAKAMl.TLR,  15X  ,6HT  TEST ,  14X  ,6HD  .  F . 

8.. 1  Sul  FORMA  T  (  5X  ,12,  i.X  ,  5F20 . 5) 

8t.  2  8»,2  FORMAT  (ShCONSTAN  T  ,  F20 . 3) 

i  8 1  j  prim  ,"di.tlrm=xi:ko" 

8i >  ’  K I.  1  l.TN 

;0  i:.p 


UNCLASSIFIED 


UNCLASSIFIED 


TABLE  20 
EXAMPLE  OUTPUT 


SAMPLE  DATA 


SMP . 

ACT.  VAL. 

VAL.  OF  INDEP. 

VAR. 

NO. 

XI, (4, 7) 

X2 ,(5,8) 

X3 , (6 ,9) 

1 

95.00 

1996,00 

178.00 

153.00 

2 

31.00 

967.00 

204.00 

144.00 

3 

60 . 00 

2414.00 

217.00 

149.00 

4 

82.00 

4418.00 

201.00 

144.00 

5 

2  5.00 

852.00 

172.00 

107.00 

6 

67.00 

2072.00 

215.00 

136.00 

7 

293.00 

10408.00 

221 .00 

177.00 

8 

54.00 

2643.00 

258.00 

160.00 

9 

112.00 

3786.00 

211.00 

172.00 

10 

106.00 

3335.00 

280.00 

203.00 

11 

183.00 

6374.00 

305.00 

196.00 

12 

156.00 

7092.00 

294.00 

187.00 

.  13 

177.00 

10304.00 

280.00 

167.00 

f  LINEAR  PER-I.EAST  SQUARES 

\  NO.  OF  INDEP.  VAR.  =  2  VARIABLE  NOS.  ARE  1  3 


SAMPLE  SIZE  -  5 

SAMPLE 

ACTUAL 

ESTIMATE 

DIFFERENCE 

PROPOR.  IRROR 

•95000E+02 

.679S1E+02 

-  .  2  701 9E+02 

- . 2844 1E+00 

•31000E+02 

.50156F+02 

.  19156E+02 

.61 793F+00 

.60000E+02 

. 69160E+02 

. 9 1 599F+0 1 

.  1  526  7F+00 

. 82000E+02 

■86038S+02 

. -01808+01 

.  492-  1F-01 

.25000E+02 

. 1 9666E+02 

- . 5 1344E+01 

-.2  11 18E+00 

SUB  SAMP I, K 

STAT 1ST  ICS 

STD.  ERROR  OF  EST. 

-  .2..7S5K 

•*-0:  R  sol  arfp  - 

.  h  7448} 

VARIABLE 

PARAMETER 

T  TKST 

!>.  Y.  * 

1 

.01 0-0 

l  .  0849S 

) 

.  7‘»1  7S 

l .0S98h 

CONSTANT 

-  7 

PREDICTIONS 


6 


ACTUAL 

ESTIMATE 

1)1  m  KI  N*  1 

PROP .IKK. 

''  A 

6  7.000 

S').  31  1 

- 1 : . 

-.17. 

-  10 

24  3.000 

1  74.447 

- b  8 . 5 s  \ 

-  .  ;  4 ; 

54.000 

80 . 2  *30 

.  b  .  2  5  0 

.  4  8»> 

112  033 

1 0 1 .  h  lf» 

-10. 

-  .  09  i 

106.000 

121.490 

IS  .W0 

.  W» 

183.000 

147. S4h 

-  IS  *S- 

-  .  19.. 

-  1  ' 

156.000 

14  7.88h 

-8.  1  u 

-  .  0  S  2 

-  i 

177.000 

185 . 44R 

-11.SS2 

-  .  0*5 

-  i 

A I 

.MV 
.000 
MU 
.000 
•  UUU 
.030 
.033 
.033 


UNCLASSIFIED 


91 


UNCLASSIFIED 


TABLE  20  (cont.) 
EXAMPLE  OUTPUT 


SAMPLE  SIZE  » 

ACTUAL 
.85000K+02 
.  U000E+02 
.  80000  E+0 - 

.82000E+02 

.25000E+02 
.h  7000K+02 


ESI  IMA  IE 
.  h<)S  U>E+02 
,518b  IK+02 
.  7O858E+02 
.8H0..8E+02 
.22  3858+02 
.57  )  12  E+0  2 


DIFFERENCE 
-.25485E+02 
.2088 I  E+0  2 
.  1085DE+02 
.  60479E+0 1 
-,2b  3h4  E+0  1 
-,8bb82E+01 


PHOPOR.  ERROR 
-.26806E+00 
. 8  7  300E+00 
. 18098E+00 
.  73754E-0! 
10546K+00 
I4430K+00 


S I'D.  ERROR  OF  EST. 


VARIABLE 

1 


SFB  SAMPLE  STATISTICS 


.211 2-K+02 


ARAMETER 
. 0 l 04u 
.  7b,  70 
-h8.  18  2  17 


R  SQUARED 


T  TEST 
1 . 2S3A  7 
1  .20181 


.8498 1E+00 


PREDICTIONS 


ACTUAL 
2-.  1.000 
5,. 000 
112. 000 
l 0b .000 
18  1. 000 
15»,000 
17,’.  000 


ESTIMATE 
1 7b. 080 
81  .1)72 
102.8  1.1 
121. 808 
1  -.8.  12  I 
1,8 . Sh  7 
1).  7.  15  2 


DI FFKRKNtK 

PROP.KRR. 

S  I  AT . 1 

STA1 . 2 

-hh  .  10 

-21. 710 

.000 

J  ? .  »i  7.’ 

2 

22.808 

.000 

-4.ih; 

-  .oh: 

-b .4  74 

.000 

l  S  . 

.  ivi 

7 .  b2  1 

.000 

-  U  .  »>  ?'• 

-lb. 885 

.000 

; .  v)  n 

-  .0-  » 

-  1. 320 

.000 

.  f.-.M 

-  .yl'n 

-  125 

.000 

\i  1UA1 

,  <  10001  *07 
.  110001  *0. 
•00001*0. 
i.  0001  *0  7 
■  0001  .0. 

•  '0001  *0. 

.  .  1001  *0  I 


t  *.  1  I  MA  II 

!.  88-01  *0  7 
.,75  .1  *07 
'  1585!  *0. 
105821  *0  I 

li,08  '  *0. 

ih  1771  *07 
7  15881  *0  1 


DI  FF1  Rl  M  l 
.2-0,01  -07 

.11.  .,!  -07 
.  1  1  *  *  '  ‘  *0  . 
.718'  71  -07 

.  .880781  *0  1 

8*.  -  ,i  -0  ; 

.  -0  .  .01  -  0  : 


PROI’OR.  ERROR 
-.278  101  +00 
.8275,1 *00 
.  201)581  +00 
.280,31  +00 
-.  Uibl.'l  *00 

-  .  1  78  3  1E+00 
-.7  888  71  -0  1 


8  AMl'l  1  1  \  !  I  s  I  1 1  ' 


'.KAMI  1  •  •> 
-01'+. 
A  ,  ,  , 


•’HI  HUM  »V- 


••  H  AKl  D 


;il  K Ft  K!  \(  !  'K  »!'  .  [  HK  . 
U.--1  ■ 


UNCLASSIFIED 


UNCLASSIFIED 


TABLE  20  (cont.) 
EXAMPLE  OUTPUT 


SAMPLE  SIZE  - 

ACTUAL 
.95000E+02 
.310001+02 
.  60000E+02 
•82000E+02 
.25000E+02 
.67000E+02 
.24  300E+03 
.54000E+02 


SAMPLE 


EST IMATE 
.60224E+02 
. 36792E+02 
.671 I3E+02 
.105011+03 
.  21808F.+02 
. 55887E+02 
.23475E+03 
.75418E+02 


Dll'FERENCE 
-.34776F.+02 
.57921E+01 
.71127E+01 
.23008E+02 
-.31915E+01 
-.1111 3E+02 
-.825051+01 
.21418E+02 


PROPOR.  ERROR 
- .  36606E+00 
.  1R684F.+00 
.  1 1  85511-4-00 
.28059E+00 
-.  12766E+00 
-.16587E+00 
-.339331- -01 
.3966  31+00 


SUli  SAMPLE  STATISTICS 


STD.  ERROR  OF 

F.ST.  - 

. 22286E+02 

VARIABLE 

PARAMETER 

1 

.01977 

2 

.  34353 

CONSTANT 

- 

31.79035 

PREDICTIONS 

ACTUAL 

ESTIMATE 

DIFFERENCE  PROP 

112.000 

102.134 

-9.866 

106.000 

103.869 

-2.131 

183.000 

161.536 

-21.464 

156.000 

1  72.637 

16.637 

177  .000 

229.258 

52.258 

SAMPLE  SIZE  - 

9 

R  SQUARED  = 


T  TEST 
5 . 16064 
.58173 


ERR. 


.107 

.295 


SAMPLE 


STAT . 1 
-8.038 
-1.165 
-14.648 
12.691 
37.294 


•92572E+00 
I) .  F .  =  5 


STAT . 2 


.000 


ACTUAL 

ESTIMATE 

DIFFERENCE 

PROPOR.  ERROR 

95000F.+02 

.622931+02 

- .  32707F.+02 

-.  34429F+00 

31000E+02 

.  18  3481+02 

.  7  3480E+01 

.  2  370311+00 

60000E+02 

.685741+02 

.8574111+01 

.  14290F.+00 

82000E+02 

.105071+0  3 

.230691+02 

.281 33E+00 

2  5000E+02 

.1947  11+02 

-.5526611+01 

-.221  0611+00 

6 7000E+02 

.561121+02 

-.108881+02 

-  .  16251E+00 

,243001+03 

.2  L  7  11  +  0  1 

-.72  7 101. +01 

-  .  299  2  2  tl-Vi  1 

,  54  0001+02 

.  7  79521+02 

.2  395211+02 

.  44  3551+00 

,  1  12001+03 

.105451+0  1 

-  .  6  5h  R  11  +  01 

-  .  58..  761-01 

STB  SAMPLE  MAI  I  M  K'S 

STD.  1RROR  OF  ESI. 

- 

-m).’ 

K  SOl'AKI  l» 

.  .425551+00 

VARIABLE. 

1 

2 

PARAMETER 
.019  3  1 
.-500  3 

i  ns; 

s . ;  .nHl 

.n 

1)  .  1  .  =  6 

CONSTANT 

-45.1  522  3 

PREDICT  IONS 

ACTUAL 
106.000 
18  3.000 
156.000 
177.000 

F.ST  LMATE 
110.682 
166.287 
176.119 
229.218 

DIFFERENCE  PROP  ERR. 

-  .682  .044 

-16.713  -.041 

20.119  .129 

52.218  •  2  9 r- 

STA'I  .  1 
: .  8H« 

I7.26h 

STAT  .  2 
.000 
.000 
.02  0 
.000 

UNCLASSIFIED 


UNCLASSIFIED 


TABLE  20  (cont.) 
EXAMPLE  OUTPUT 


SAMPLE  SIZE  -  10 

SAMPLE 

ACTUAL 

ESTIMATE 

DIFFERENCE 

PROPOR.  ERROR 

. 9  5000E+02 

.61592E+02 

33408E+02 

-.35166E+00 

. 31000E+02 

.  37856E+02 

. 68557E+01 

. 22115E+00 

. 60000E+02 

.6O175E+02 

.81745E+01 

.  1 3624E+00 

.82000E+02 

.  10540E+03 

. 23398E+02 

.28534E+00 

. 2  5000E+02 

.  2081 7E+02 

-.41830E+01 

-. 16732E+00 

. A  7000E+02 

.  56285E+02 

- . 10715E+02 

-.15992E+00 

. 24300E+03 

.  23582E+03 

-.71777E+01 

- . 29538E-01 

.  54000F.+02 

.  7705  3E+02 

. 23053E+02 

.42691E+00 

. 1 1200K+03 

.  104  22E+03 

- . 77804E+01 

- . 69A68E-01 

. 1 0800K+03 

.  107  78E+03 

. 17826E+01 

. 16817E-01 

SUB  SAMPLE  STATISTICS 


f  STD.  ERROR  OF  K ST .  =  .19110E+02  R  SQUARED  =  .92613E+00 


VARIABLE 

PARAMETER 

T  TEST 

D.F.  =  7 

1 

.01957 

7.27017 

> 

.39968 

1.40131 

CONSTANT 

V. 

-38.62356 

PREDICTIONS 

ACTUAL 

ESTIMATE 

DIFFERENCE 

PROP. ERR. 

STAT . 1 

STAT .  2 

183.000 

164 . 464 

-18.536 

-.101 

-15.670 

.000 

1 58.000 

174.919 

18.919 

.121 

16.232 

.000 

177.000 

229.790 

52.790 

.298 

38.056 

.000 

<-U 


UNCLASSIFIED 


UNCLASSIFIED 


TABLE  20  (cont.) 
EXAMPLE  OUTPUT 


SAMPLE  SIZE  -  11 

SAMPLE 


ACTUAL 

ESTIMATE 

DIFFERENCE 

PROPOR.  ERROR 

9S000E+02 

.62508E+02 

32492E+02 

- .  34  202E+00 

31000E+02 

. 37831E+02 

■68314E+01 

.22037E+00 

60000E-r02 

.68869E+02 

. 8869 1 E+01 

.  1 4  7 82 E '*-00 

82000E+02 

. 1061 5E+03 

. 24148E+02 

•29449E+00 

2  5000E+02 

•17849E.02 

-.71 508E+01 

-.2860311+00 

67000E+02 

.55878E+02 

1 1 122E+02 

- .  1 6O00E+00 

24300E+03 

.24052E+03 

-.2481 3E+01 

- .  1 021 1 E-0] 

54000E+02 

.78666E+02 

. 24666E+02 

,45o7oE+00 

1 1200E+03 

.10704E+03 

-.4964  3E+01 

-.44  I241.-01 

10600E+03 

. 1 1294E+03 

.69421E+01 

.65492E-01 

18300E+03 

. 16975E+03 

-.1 3247E+02 

-.  72388E-01 

SUJ  SAMPLE  STATISTICS 

STD.  ERROR  OF  EST.  - 

.18715E+02 

R  VIUARKD  • 

. 9  3468E+00 

VARIABLE 

PARAMETER 

T  TEST 

D.F.  -  8 

1 

.01980 

7.54797 

2 

.47853 

1.81979 

CONSTANT 

-50.22009 

PREDICTIONS 

ACTUAL 

EST IMATE 

DIFFERENCE 

PROP. ERR. 

STAT.l 

STAT  .  2 

156.000 

179.660 

23.660 

.152 

21  .020 

.000 

177.000 

233.675 

56.675 

.320 

41.525 

.000 

SAMPLE  SIZE  -  12 

SAMPLE 

ACTUAL 

ESTIMATE 

DIFFERENCE 

PROPOR.  ERROR 

•95000E+02 

. 62046E+02 

-. 32954E+02 

-.  34689E+00 

. 31000E+02 

.  38348E+02 

.  7  34  79E+01 

.2370311 +00 

. 60000E+02 

■68246E+02 

.B2456K+01 

.1  374  311+00 

. 82000E+02 

.104  32E+03 

. 2231511 -r02 

.2721 3E+00 

. 25000F.+02 

.19590E+02 

-.540951+01 

-.216  3811+00 

. 6  7000E+02 

.  55890E+02 

-.111 1 011+02 

-.1658211+00 

. 24  300E+03 

.23358E+03 

-.941 50E+01 

-.  38 74 5E -01 

. 54000E+02 

•77546E+02 

•23546E+02 

.  4  3604E+00 

.  U200E+03 

.  10477E+03 

- . 72  346E+01 

-.64  59  511-01 

. 10600E+03 

.  1  1002E+0  3 

.401821+01 

.  17908E-01 

.  18300E+03 

.  16498E+03 

-.1802  1L+02 

-  .484861-01 

. 1 5600E+03 

.  1 746  7F.+  03 

.  1P674E+02 

.  1  197011+00 

SUB  SAMPLE  STATISTICS 

STD.  ERROR  OF  EST.  =  .  1 8985F.+0.!  K  SOl'ARU)  =  .9J4  7f>F  +  W 


VARIABLE 

1 

2 

CONSTANT 


PARAMETER 

.01912 

-44 . 5831 ^ 


1  TEST  D.F.  *  9 

7 .  18*70 
1  .h870(i 


PREDICTIONS 


ACTUAL  ESTIMATE  DIFFERENCE  PROP. ERR.  S FA  I  .  1 

177.000  227.122  50.122  .281  37.721 


S I A 1  .  : 

.000 


UNCLASSIFIED 


TABLE  20  (cont.) 
EXAMPLE  OUTPUT 


f  ENTIRE 

SAMPLE  USED 

3  \ 

SAMPLE  SIZE  =  13 

SAMPLE 

f  ACTUAL 

ESTIMATE 

DIFFERENCE 

PROPOR.  ERROR 

.95000E+02 

. 642  34E+02 

- .  30766E+02 

- . 32386E+00 

. 31000E+02 

. 42177E+02 

.  11177E+02 

. 36055E+00 

, B0000E+02 

. 68374E+02 

.  C3743E+01 

. 1395  7E+00 

. 82000E+02 

. 9  7 150E+02 

.  15150E+02 

. 18475E+00 

. 25000E+02 

. 17056E+02 

- .  7944  2E+01 

- . 31777E+00 

.  6  7000E+02 

.  54  744E+02 

- .  12256E+02 

- .  18293E+00 

4  < 

. 24  J00E+03 

. 2 1 334E+03 

- .  29661E+02 

- .  12206E+00 

. 54000E+02 

. 78946E+02 

.  24946E+02 

.  46196F.+00 

. 1 1200E+03 

. 10471E+03 

-.  72932E+01 

- .  65118E-01 

. 10600E+03 

. 1 1 704E+03 

.  11035E+02 

.  10411E+00 

. 18300E+03 

. 16104E+03 

- .  21961E+02 

- .  12001E+00 

. 1 5600E+0  3 

. 16681E+03 

.  10811E+02 

. 69304E-01 

^  . 17700EE03 

. 20539E+03 

.  28388E+02 

.  16038E+00 

SUB  SAMPLE  STATISTICS 


STD.  ERROR  OF  EST. 

.  =  . 2  1602E+02 

R  SQUARED  = 

. 90936E+00 

VARIABLE 

PARAMETER 

T  TEST 

D.F.  =  10 

1 

.01593 

6.88904 

) 

.62944 

2.22171 

CONSTANT 

-6  3 . 8665  3 

FINAL 

OUTPUT 

ACTUAL 

ESTIMATE 

DIFFERENCE 

PROP. ERR. 

STAT.l 

STAT . 2 

t)  7 . 000 

55.311 

-1  1.689 

-.174 

-10.631 

.000 

243.000 

1  76.090 

-66.910 

-.275 

-21.710 

.000 

54.000 

85.44  1 

31.441 

.582 

25.950 

.000 

112.000 

102. 1  54 

-9 . 866 

1 

TS 

oc 

CC 

-8.038 

.000 

10h.000 

1 10.682 

4.682 

.044 

2.889 

.000 

18  5. 000 

1  64 .464 

-18.536 

-.101 

-15.670 

.000 

156.  000 

179.600 

23.660 

.152 

21.020 

.000 

1 77.000 

227. 122 

50.122 

.283 

37.721 

.000 

WE.  PROPORT  1  ONAi 

ERROR  = 

.2027 

5 1  AS 

= 

.0779 

S  K  T.WN  ESS 

= 

.  1662 

UNCLASSIFIED 


fa. 


*tsU 


UNCLASSIFIED 


B.  PROGRAM  INPUT 

Data  for  the  program  are  stored  in  two  data  files  (for  purposes  of 
compilation  economy)  called  RUNDAT  and  TOTDAT.  TOTDAT  consists  of  the 
data  base,  i.e.,  the  physical  and  performance  characteristics  and  cost 
of  the  historical  procurements.  Each  row  of  the  data  matrix  corresponds 
to  one  procurement  (Table  21).  Each  of  the  procurements,  NUMS  in  number, 
has  associated  with  it  a  cost  and  a  value  for  each  physical  and  performance 
characteristic.  If  there  are  NUMV  characteristics,  then  there  will  be 
NUMV+1  entries  for  each  procurement,  and  hence  there  will  be  NUMV+1  times 
NUMS  numbers  in  the  procurement  data  base. 


TABLE  21 

DATA  BASE  ARRANGEMENT  IN  TOTDAT 

Procurement  Physical  or  Performance  Characteristic  Number 

Number  1  2  3  4  ....  NUMV  Cost 

Oldest  1 
2 

3 

4 

Newest  NUMS 

NUMV:  number  of  Physical  and  Performance 
Characteristics 

NUMS:  number  of  Procurements 


DATA  ENTRIES 


UNCLASSIFIED 


UNCLASSIFIED 


For  Historical  Simulation,  the  procurements  must  be  ordered  in 
time,  with  the  oldest  in  the  first  line  of  data.  If  this  is  the  order  of 
the  data  in  TOTDAT,  then  enter  a  zero  after  the  data  for  the  last  pro¬ 
curement.  This  value  is  used  by  an  indicator  variable  NORD  which  leaves 
the  data  base  alone  when  it  equals  zero. 

If,  however,  the  data  base  has  a  different  order,  let  the  value  of 
NURD  equal  one.  Follow  this  by  NUMS  numbers,  one  for  each  procurement 
indicating  the  transformation  necessary  to  order  the  data  in  time. 

The  data  used  in  the  test  run  are  given  in  Table  22.  The  data 
base  is  contained  in  the  first  13  lines,  101-113,  one  for  each  procure¬ 
ment.  There  are  four  entries  in  each  line,  as  there  are  values  for 
three  independent  variables  and  the  cost  for  each  procurement.  Hence, 
NUMS  =  13  and  NUMV  =  3  foi  the  test  run.  The  next  entry,  line  120, 
gives  NORD  a  value  of  one,  hence  the  data  will  be  reordered.  The  new 
ordering  is  given  in  line  121.  The  first  row  of  data,  line  101,  will 
become  row  2,  the  second  .line  will  become  row  4  and  the  fourth  line  will 
become  row  1.  All  other  rows  will  remain  the  same. 

RUN DAT  contains  the  remaining  data  arranged  as  shown  in  Table  23. 
The  first  entries  describe  the  amount  of  data  in  TOTDAT.  They  are  the 
number  of  physical  and  performance  characteristics,  NUMV,  and  the  number 
of  procurements,  NUMS. 

The  next  entry  is  an  output  designator  called  OUT.  The  value 
chosen  will  dictate  the  output  option  for  the  run.  The  options,  together 
with  the  applicable  value  of  OUT,  will  be  described  under  data  output. 

The  next  entries  describe  the  time  groupings  of  the  data  in  TOTDAT. 
The  first  entry,  N'UMT ,  defines  the  number  of  time  groupings.  It  is 
followed  by  NT'MT  numbers  (stored  in  a  vector  called  ITBAT)  which  tell 
how  many  procurements  are  in  each  grouping.  The  effect  of  these  numbers 
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UNCLASSIFIED 


UNCLASSIFIED 


TABLE  22 
TEST  RUN  DATA 


TOT DAT 

13: 

19 

LA 

101 

967, 

204,144,31 

102 

4418 

,201, 

144, 

82 

103 

2414 

,217, 

149, 

60 

104 

1996 

,178, 

153, 

95 

105 

852  , 

172,107,25 

106 

2072 

,215, 

136, 

67 

107 

10408,221 

,177 

,243 

108 

2643 

,258, 

160, 

54 

109 

3786 

,211, 

172, 

112 

110 

3335 

,280, 

203, 

106 

111 

6374 

,305  , 

196, 

183 

112 

7092 

,294, 

187, 

156 

113 

10304,280 

,167 

,177 

120 

1 

121 

2,4,: 

3,1,5 

,6,7 

,8,9,10, 

RUN DAT  13:18  LA"T" 

100  3,13,1,9 

102  5, 1,1, 1,1, 1,1, 1,1 

103  0,0, 0,0, 0,5, 6, 7, 8, 9, 

104  2,1,3 


"  04/20/69 


11,12,13 


04/20/69 

,11,12, 


UNCLASSIFIED 
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,  mrimm*  MuuPWBtu*  mmm 


■  ine  Number 


3 

A 

5 


A+K 


UNCLASSIFIED 


TABl.F.  23 
RUN DAT  DATA 


Data  File  Values 


NT  MV,  NTMS ,  OUT,  Nl'MT 
Vector  ITBAT  (NL'MT  entries) 

Vector  NW  (NUMS  entries) 

NIV,  NIV  numbers  (characteristic  identifiers) 
Repeat  of  line  A  for  new  PER 


Repeat  of  line  A  for  final  PER 


NUMV 


number  of  characteristics  to  be  considered. 
Must  have  every  characteristic  called  for 
bv  the  PERs. 


NTMS : 
OUT: 
Nl’MT : 
I  THAI 
NW : 

K : 

NIV: 


number  of  procurements  in  sample 

output  designator 

number  of  time  batches 

vector  for  time  grouping  observations 

vector  for  final  output  weights 

number  of  PERs 


number  of  independent  variables  for  a 
particular  PER 


UNCLASSIFIED 


ion 


UNCLASSIFIED 


is  to  tell  the  program  how  many  new  procurements  to  include  in  the  next 
sample  that  is  to  be  given  to  the  evaluation  procedure.  If  the  sample 
previously  used  contained  the  first  n^  procurements  (.from  TOTDAT)  and 
the  next  number  in  ITBAT  is  n^  >  then  the  next  sample  to  be  processed 
will  consist  of  the  first  n^+n }  procurements  (from  TOTDAT). 

The  next  entries  are  elements  of  the  vector  NW .  There  is  one 
entry  for  each  procurement.  These  are  the  weights  that  will  be  assigned 
to  each  residual  for  the  calculation  of  Average  Proportional  Error  and 
the  other  summary  measures  of  bias  and  skewness  (see  Sec.  IV  B).  They 
can  be  integer  weights  as  the  computer  will  divide  by  their  sum. 

The  final  entries  in  RUNDAT  describe  the  PER  to  be  used.  The 
first  entry  corresponds  to  the  number  of  physical  and  performance 
characteristics,  NIV,  and  is  followed  by  NIV  numbers  identifying  the 
specific  characteristics.  For  example,  2,  1,  3  would  indicate  that  the 
PER  consists  of  two  characteristics  and  they  are  numbers  1  and  3.  These 
latter  numbers  will  tell  the  program  which  columns  of  TOTDAT  to  consider. 

Provision  in  the  program  has  been  made  to  evaluate  more  than  one 
PER  in  each  computer  run.  Each  PER  must  have  the  line  of  data  just  dis¬ 
cussed  (i.e.,  NIV  and  NIV  characteristic  identifiers).  This  is  the  only 
additional  data  needed,  provided  that  all  the  independent  variables  are 
included  in  TOTDAT. 

Test  run  values  for  RUNDAT  are  given  in  Table  22.  In  line 
100,  NUMV  =  3,  MUMS  =  13,  OUT  =  1,  Nl'MT  =  9.  The  time  groupings  (vector 
ITBAT)  are  given  in  line  102.  The  first  subsample  will  be  5  with  one 
data  point  being  added  for  each  subsequent  subsample. 

The  third  line  of  data  contains  the  weights  for  each  of  the  residuals 
No  weight  is  given  to  the  f!rst  3,  as  no  prediction  o'.  the  will  be  made. 
Weights  for  the  remaining  points  are  the  suhsample  size  from  which  the 


UNCLASSIFIED 


101 


UNCLASSIFIED 


prediction  was  made.  Thus  sample  point  6  will  have  a  weight  of  5, 
point  7  will  have  a  weight  of  6,  and  so  forth. 

The  final  line  of  output  indicates  that  the  PER  has  two  independent 
variables  and  they  are  variables  1  and  3. 

Most  of  the  data  are  entered  into  the  program  in  Step  1  (Fig.  5), 
Input  Data  base  (lines  12-18,  Table  17).  This  includes  all  the  data 
with  the  exception  of  NIV  and  the  characteristic  designators.  Ordering 
of  the  data  base,  if  necessary,  takes  place  in  lines  19-28  of  Table  17. 

In  addition,  the  option  to  print  the  sample  data  from  TOTDAT  has  been 
provided  in  Lines  30-62,  Table  17.  The  form  of  this  output  can  be  seen 
in  Table  20  [output  (1)].  If  there  are  more  than  three  independent 
variables,  their  values  will  be  printed  under  the  values  for  XI,  X2  and 
X3  (e.g.,  X4  and  X7  would  appear  under  XI,  etc.). 

NIV  and  the  characteristic  designators  are  read  in  Step  2,  Fig.  5, 
Estimation  Procedure  Description  (lines  351-367,  Table  18).  The  Step  2 
data  define  the  particular  PER  to  be  used.  New  PERs  are  also  defined  in 
Step  2  at  the  end  of  a  loop  from  Step  10. 

There  are  no  limits  on  the  number  of  PERs  that  can  be  evaluated 
in  a  given  run.  There  are,  however,  upper  limits  on  the  number  of  pro¬ 
curements,  Nl'MS,  and  number  of  characteristics,  NUMV.  These  are 
currently  programmed  at  42  and  6;  however,  there  is  a  tradeoff  between 
them.  From  what  I  have  been  able  to  gather  about  the  MARK  I  G.E. 
lime  Sharing  System,  for  which  Historical  Simulation  has  been  programmed, 
all  admissible  combinations  of  upper  limit  values  for  NUMS  and  NUMV,  for 
which  any  of  the  possible  ’I.R  specifications  (combinations  of  any  subset 
ot  the  NUMV  variables)  can  be  run,  are  given  in  Table  24.  The  table  is 
stopped  at  NUMV  =  12  for  the  reason  that  NUMV  =  13  would  yield  an 
Nl'MS  =  I  •  and  thus  not  till  13  variables  could  be  used  as  NUMS  >  NUMV+1 
in  order  to  lit  the  curves  with  a  finite  variance  estimate.  No 
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advantage  would  be  gained  over  the  case  when  NUMV  =  12  and  it  is  possible 
to  compile  a  larger  sample  (i.e.,  value  of  NUMS). 

TABLE  24 


POSSIBLE 

UPPER  LIMIT  VALUES 
USING  LINEAR  PER 

FOR  HISTORICAL 
-LEAST  SQUARES 

SIMULATION 

If 

NUMV  = 

i 

O 

3 

4 

5 

6  7 

8 

Then 

NUMS  £ 

183 

115 

83 

63 

51 

42  35 

30 

If 

NUMV  = 

9 

10 

11 

12 

Then 

NUMS  < 

26 

22 

19 

16 

C.  CALCULATIONS  AND  PREPARATION  FOR  OUTPUT 

The  actual  program  calculations  are  initiated  in  Step  2,  Estimation 
Procedure  Description  (Fig.  5).  In  addition  to  the  PER  specification, 
discussed  in  the  last  section,  the  minimum  sample  size  is  calculated  in 
this  step  (line  359,  Table  18).  The  minimum  sample  size  required  depends 
on  the  PER  and  the  technique  being  tested.  For  Linear  PER-l.east  Square 
procedures  the  minimum  sample  size  equals  the  number  of  independent 
variables  in  the  PER  (NIV)  plus  two  (i.e.,  one  larger  than  the  number  of 
parameters  being  estimated  including  the  constant),  so  that  estimates  of 
variance  are  not  infinite. 

The  final  task  performed  in  Step  2  is  to  print  out  a  description 
of  the  estimating  procedure  being  used  (lines  375-399,  Table  18).  The 
output  block  (2),  in  Table  20,  is  printed  out  for  Linear  PER-Least 
Square  procedures.  In  addition  to  the  name,  "LINEAR  PER-LEAST  SQUARES," 
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tlu*  I’KK  description  consisting  of  number  of  independent  variables 
( N 1 V  =  2  in  the  example)  and  the  characteristic  numbers  (1  and  3  in 
the  example)  are  displayed.  This  block  of  output  is  repeated  for  each 
PKR  evaluated  in  the  run.  If  a  second  P KR  were  evaluated  in  the  test 

run,  this  block  of  output  for  the  new  PKR  would  appear  after  output 
block  (7)  in  Table  20. 

The  next  operation  performed  by  the  computer  is  to  Set  Up  the 
Usable  Data  base  (Step  1,  Fig.  5).  The  data  matrix  entered  in  Step  1 
for  TOTDAT  is  reduced  in  size  by  excluding  characteristics  not  included 
in  the  PKR  defined  in  Step  2  (lines  73-89,  Table  17).  For  the  test  run 
(Table  2(J)  characteristic  2  is  excluded  from  the  rest  of  the  PKR 
eva 1 uat ion . 

Control  is  now  passed  to  Step  4a  (Fig.  5),  in  which  the  Initial 
Historical  Sample  Setup  takes  place.  In  lines  101-133,  Table  17,  data 
groupings,  defined  by  the  vector  IT HAT,  are  added  until  the  sample  size 
is  greater  than  or  equal  to  the  minimum  sample  size  defined  in  Step  2. 
There  may  he  situations  in  which  the  analyst  wishes  to  specify  a  largei 
minimum  sample  size  than  the  one  automatically  calculated.  This  can  be 
done  by  making  the  first  entry  in  ITBAT  (see  Table  23)  the  size  of  the 
minimum  sample  desired. 

In  l  lie  test  run,  XIV  =  2  (line  106,  RUNDAT  Data,  Table  22)  and  the 
tirst  entry  in  I  THAT  was  5  (first  entry,  line  102,  same  table).  If 
<  "P.AI  (1  )  i,  then  the  first  subsample  would  have  been  equal  to  the 
.  intr.ur.  sample  size,  NIV+2  =  4.  With  ITP.AT(l)  =  5,  however,  the  first 
.-ail' samp  1  e  size  i  >  [first  output  block  (3),  Table  20].  Hence  the 
sample  given  to  the  estimation  procedure  consists  of  the  first  five  pro¬ 
curements  ot  H'TI'AT  with  values  for  characteristic  1,  characteristic  3, 
and  the  actual  cost  lor  each  procurement.  The  sample  size  obtained  is 
printed  out  as  the  tirst  data  block  (3),  Table  20. 
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The  final  operation  in  Step  Aa  is  a  housekeeping  chore.  A  loop 
is  set  up  for  the  remaining  samples  (lines  137-257,  Table  17).  The  loop 
initiates  with  the  number  of  entries  in  ITBAT  used  up  to  achieve  the 
minimum  sample  size  and  is  entered  as  many  times  as  there  are  entries 
left  in  ITBAT.  In  the  test  case,  the  nu  mber  of  time  groupings  in  ITBAT, 
NUMT,  equaled  9.  The  values  were  5,  1,  1,  1,  1,  1,  1,  1,  1.  One  entry 
was  used  in  setting  up  the  initial  sample.  The  loop  will  therefore  go 
from  2  to  9,  resulting  in  eight  more  samples. 

The  new  samples  are  defined  in  Step  Ah  (Fig.  5),  Set  Up  Next 
Historical  Sample,  as  the  loop  is  reentered  (lines  1A3-181,  Table  17). 
Observations  are  added  to  the  sample  being  passed  to  the  evaluation 
procedure  by  adding  the  next  n  observations  from  the  data  base  (Step  ’•), 
n  being  defined  by  ITBAT.  For  the  test  run  this  process  results  in 
sample  sizes  of  6  through  13  (the  total  sample  for  TOTDAT).  As  each 
sample  is  set  up  its  size  is  printed  out  [output  block  (3),  Table  20]  to 
indicate  that  the  next  iteration  is  being  started. 

The  sample  defined  in  Step  Aa  or  Ab  is  now  passed  to  Step  5 
(Fig.  5),  where  the  computer  Uses  the  Technique  to  Calculate  the  Param¬ 
eters  of  the  PER.  In  the  test  run,  the  technique  is  least  squares,  and 
the  following  operations  are  accomplished. 

•  Calculate  arithmetic  means  of  sample  characteristics 
and  costs  (lines  610-630,  Table  19) 

«  Calculate  sample  cross  product  matrix,  i.e., 

^  (Xij  "  Xj)(Xik  "  V  dines  650-680,  Table  19)* 

y?  -}\ 

•  Invert  cross  product  matrix  (line  702,  Table  20) 

_ 

This  is  analogous  to  the  S  matrix  referred  to  in  Appendix  II.  The 
calculation  is  different  in  that  the  characteristic  data  are  centered. 
The  difference  in  the  matrices  is  to  prevent  round-off  errors  from 
occurring  in  the  computer  (see  Ref.  6,  page  1AA). 
y?yc 

The  program  uses  a  matrix  inversion  routine  that  can  be  obtained  from 
Ref.  11,  program  9.6. 
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On  the  first  iteration  the  sample  being  worked  on  is  of  size  5 
(in  our  test  run).  At  each  succeeding  iteration  the  sample  expands  to 
obtain  new  PER  parameter  estimates.  These  parameter  estimates  (calculated 
in  lines  705-720,  Table  19)  are  passed  to  Step  6b  (by  use  of  the  common 
package)  and  define  the  CKR  for  that  iteration.  In  addition,  the  arith¬ 
metic  means  (lines  610-610,  Table  19)  are  retai;  -d  for  use  in  Step  8b 
(by  the  common  package). 

Control  is  now  passed  to  Step  6a  (Fig.  5),  where  the  machine 
Calculates  Any  Desired  Iteration  Output  Statistics  Valid  for  the  Technique. 

A  minimum  output  for  any  technique  would  be  the  PER  parameter  values.  For 
some  techniques  this  may  be  alt  that  is  desired. 

For  the  Linear  PER-Least  Squares  example  being  considered,  the 
tollowing  operations  are  performed: 

•  Using  the  CER  defined  in  6b  (lines  420-448,  Table  18),  calculate 
Fit  Data  for  the  sample  (lines  721-801,  Table  19).  This  includes 
an  estimate  for  each  procurement  in  the  sample  (5  for  the  first 

iteration),  the  standard  error  of  the  estimate,  and  an  unadjusted 

■) 

R“  (square  of  the  multiple  correlation  coefficient). 

•  Print  out  (if  desired)  for  eacli  procurement  the  actual  cost, 
estimated  cost,  cost  difference,  and  proportional  cost  differ¬ 
ences.  This  is  shown  as  output  block  (4)  in  Table  20  and 
executed  in  1 ines  728  and  749  of  Table  19. 

•  Calculate  the  variance-covariance  matrix  for  the  parameters 
(lines  806-845,  Table  19).  Deliver  through  the  common  package 
to  Step  8b  for  use. 

•  Use  diagonal  elements  of  variance-covariance  matrix  to  calculate 
t-statistics  for  each  of  the  parameters  (line  853,  Table  19). 

•  In  lines  846-855,  Table  19,  print  subsample  statistics,  data 
block  (5)  in  Table  20.  These  include  the  standard  error  of 
the  estimate,  R“  ,  the  parameter  values,  the  t-statistics, 
and  the  degrees  of  freedom. 
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Control  is  now  passed  to  Step  7  (Fig.  5)  where,  in  line  201  of 
Table  17,  it  is  decided  whether  a  new  subsample  must  be  processed.  If 
the  entire  sample  has  been  used,  then  control  is  passed  to  Step  9 
(line  281).  If  not,  there  are  procurements  in  the  data  base  (Step  3) 
which  were  not  used  in  estimating  the  parameters.  Control  is  passed  to 
Step  8a  where  the  machine  Makes  Predictions  for  these  procurements  and 
Stores  Data  for  Summary  Statistics.  For  each  procurement  the  following 
steps  occur  (lines  209-265,  Table  17): 

•  Predict  the  cost  of  the  procurement  using  the  Cl'R  in  6b 
and  characteristics  from  the  data  base  in  Step  3. 

•  Record  actual  cost,  cost  difference  (from  predicted), 
and  proportional  cost  difference. 

•  Calculate  any  special  statistics  (line  233)  from  the 
technique  using  8b  (described  below). 

•  Print  prediction  statistics,  if  desired.  This  is  output 
block  (6)  in  Table  20. 

•  Check  to  see  if  procurements  will  be  included  in  next 
sample.  If  they  will  be,  store  the  values  calculated 
above  for  the  summary  output,  data  block  (7). 

The  special  statistics  referred  to  above  are  calculated  in  Step  8b, 
Fig.  5  (lines  580-598,  Table  18),  Calculate  Predictive  Statistics 
Peculiar  to  Technique.  For  the  Linear  PER-Least  Square  Procedures, 

STAT  1  is  the  value  of  the  t-distribution  for  the  difference  between  the 
actual  and  predicted  procurement  cost  times  the  standard  deviation  of 
the  process,  o  .  Space  has  been  left  for  a  STAT  2  which  is  not  presently 
used.  The  arithmetic  means  of  the  characteristics,  calculated  in  Step  5, 
and  the  parameter  variance-covariance  matrix,  calculated  in  Step  6a,  are 
used  to  calculate  the  t-statistic  for  each  procurement,  together  with 
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the  procurement  information  from  Step  8a  listed  below: 


The  predicted  cost 


The  actual  cost 


The  characteristic  values 


The  mathematical  equations  used  to  calculate  these  statistics  are 
similar  to  those  used  to  calculate  prediction  intervals.  Given  a  new 


procurement,  with  characteristics  ( x ^ ,  ...  ,  x  )  ,  an  actual  cost  A  , 


a  sample  with  m  observations,  (X.,,X.„,  ),  i  =  1,  2, 

r  1 1  1 2  in 


.  .  .  ,  m  ,  and  <i  PER  which  contains  n  in  lependent  variables  (X1  , 
*  1 


.  .  ,  X  )  ,  then 
n 


P  -  A 


r  1  -1 
1  +  -  +  DR  1) 
m 


has  the  t-distribution  with  m  -  (n+1)  degrees  of  freedom.  In  the  above 
expression  we  have  the  following  definitions: 


S  =  n-n  inverse  of  the  covariance  matrix  of  the  sample 


values  of  X.,  -  X.  and  X..  -  X 
tk  i  jk  j 


1)  =  n-dimens ional  column  vector  of  terms  x.  -  x 

i  i 


X  =  the  arithmetic  mean  of  the  sample  values  of  X. 
i  l 


x  =  value  of  itli  independent  variable  for  the  new  procurement 
1  — 


SF.K  =  the  standard  error  of  the  estimate  for  sample  size  m 


P  =  predicted  value  for  the  new  procurement. 


The  quantity  calculated  in  the  computer  program  is  Eq.  34  times 


SHE  .  This  is  the  output  that  is  used  to  calculate  the  statistics 
m 


discussed  in  Sec.  IV  C . 


Reference  3,  page  20. 
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At  this  point  (line  257,  Table  17),  control  passes  to  Step  4b 
where  the  next  historical  sample  is  set  up  (as  previously  discussed). 

Steps  5-7  are  then  repeated  for  the  new  sample.  This  process  is  con¬ 
tinued  until  the  entire  data  base  (Step  3)  is  used.  Since  there  are 
then  no  procurements  to  predict,  control  is  passed  to  Step  9  (line  277, 
Table  17),  Calculate  Summary  Statistics. 

One  value  of  each  of  the  prediction  outputs  (data  bJock  6, 

Table  20)  has  been  saved  for  each  procurement  predicted.  In  the  test 
run  this  would  exclude  procurements  1-5,  as  they  were  included  in  all 
the  subsamples  and  hence  never  predicted.  The  values  retained  are  those 
generated  by  the  largest  subsample  used  in  the  prediction  of  the  partic¬ 
ular  procurement.  These  data  are  printed  out  in  output  block  7, 

Table  20  (line  301,  Table  17).  In  the  test  run  the  procurements  printed 
out  were  6-13.  Object  6  was  estimated  with  five  procurements  in  the 
sample,  object  7  with  six  in  the  sample,  and  so  forth. 

These  data  are  also  used  to  calculate  summary  statistics  (line 
289-327,  Table  17).  At  present  these  include  the  average  proportional 
error,  a  measure  of  bias,  and  a  measure  of  skewness.  The  weights  in 
NW(I)  from  RUNDAT  are  used  in  these  calculations.  In  the  test  cases, 
weights  equal  to  the  subsample  size  are  used.  Thus  procurement  6  receives 
a  weight  of  5,  procurement  7  a  weight  of  6,  and  so  forth.  For  a  discus¬ 
sion  of  these  calculations,  see  Sec.  IV  B. 

Control  now  passes  to  the  final  step  of  the  program,  Step  10,  where 
it  is  determined  whether  a  new  PER  is  to  be  evaluated.  If  not,  the  run 
ends.  If  there  is  a  new  PER,  as  given  by  a  new  value  for  NIV  and  new 
characteristic  numbers  (Table  23),  control  is  passed  to  Step  2  for  new 
PER  definition.  In  the  test  run  there  was  no  new  PER  defined,  so  the 
program  terminated. 
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l).  PROGRAM  OUTPUT 

The  preceding  paragraphs  have  described  Che  complete  output  avail¬ 
able  for  the  program.  This  output,  Table  20,  can  be  divided  into  two 
classes:  those  printed  in  steps  listed  under  the  Estimation  Procedure 
(Fig.  5)  and  those  printed  in  the  Main  Program.  Output  data  blocks  (1), 
(J),  (6),  and  (7)  fall  into  this  latcer  category.  These  blocks  can  be 
printed  out  no  matter  what  estimation  procedure  is  being  tested.  Their 
form  will  not  change  with  different  estimating  procedures. 

Data  blocks  (2),  (4),  and  (5)  are  printed  out  in  steps  listed  under 
the  estimation  procedure.  Their  form  and  content  will  change  depending 
on  die  estimation  procedure  (or  class  of  procedures)  being  tested. 

The  output  of  some  of  the  data  blocks  is  optional.  The  output 
designator,  called  OUT,  is  used  to  tell  the  machine  what  output  to  print. 
The  following  options  are  available: 


V.i  1  lies 
nf 

nrr 


Output  Options  (Excluded  Blocks  Checked) 

Valid  for  Particular  Estimating  Procedures 

(2)  (4) 


Independent  of  Estimating  Procedures 
0)  (1)  ((>)  (7) 


(5) 


Deleting  the  output  of  data  blocks  when  they  are  not  needed  saves 
machine  time  and  reduces  the  complexity  of  the  output.  The  amount  of 
output  is  greatly  reduced  when  OUT  =  6,  as  can  be  seen  by  visualizing 
Table  20  without  data  blocks  (1),  (4),  and  (6). 


no 
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It  should  be  noted  that  in  any  run  with  more  than  one  PER,  and  a 
value  of  OUT  1  to  3,  the  sample  [data  block  (1)]  will  be  printed  out 
only  one  time.  The  number  3  will  be  added  to  the  value  of  OUT  to  pre¬ 
vent  useless  repetition  of  data  block  (1),  which  does  not  change  with 
new  PERs  in  a  given  run. 

This  concludes  the  presentation  of  the  program  currently  available 
for  Linear  PER-Least  Square  Procedures.  It  should  again  be  pointed  out 
that  operations  on  the  left  side  of  Fig.  5  are  not  dependent  on  the 
estimating  procedure  being  evaluated.  As  subroutines  for  other  classes 
of  estimating  procedures  are  developed,  such  as  log-linear  PERs,  they 
will  be  tied  into  the  operations  on  the  left,  the  main  program. 

The  program  described  in  this  appendix  is  operational  on  the  G.E. 
Time-Sharing  Service,  MARK  I.  It  therefore  can  be  run  on  any  terminal 
having  that  service.  This  flexibility  tends  to  be  offset  by  the  slowness 
of  the  output  vi*  the  teletype.  Thirty-one  minutes  of  console  time  was 
required  for  the.  ast  run. 

Another  drawback  of  working  with  the  time-sharing  service  is  the 
space  limitation.  As  can  be  seen  in  Table  24,  the  size  of  the  possible 
data  base  is  not  large,  although  for  the  cost  application  it  seems 
adequate.  Additions  to  the  program  cannot  be  made,  however,  without 
seriously  diminishing  this  data  base. 

No  attempt  has  been  made  to  clean  up  the  program  to  obtain  greater 
efficiency.  It  is  to  be  expected  that  improvements  in  operation  time 
and  space  can  be  made  by  making  the  prog.ram  or  its  output  more  efficient. 

An  alternative  to  tiiis  approach  is  to  convert  the  program  to  a 
non-time-sharing  machine;  the  program  has  been  written  in  FORTRAN  which 
should  make  conversion  reasonably  easy.  There  would  certainly  be  savings 
in  terminal  time,  although  turn-around  time  will  probably  be  longer 
(i.e.,  overnight  service). 
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lhese  suggestions  t  or  program  improvement  have  not  been  implemented 
to  date  as  tin-  program  without  any  changes  is  adequate  for  demonstrating 
the  Historical  Simulation  procedure  and  this  was  the  purpose  for  which 
it  was  written. 
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APPENDIX  II 

DISTRIBUTION  OF  HISTORICAL  SIMULATION  PREDICTORS 
AND  RESIDUALS  UNDER  THE  USUAL 
MULTIPLE  LINEAR  REGRESSION  MODEL  ASSUMPTIONS 


In  this  appendix  it  is  assumed  that  the  usual  multiple  linear 
regression  model  holds.  From  this  assumption  the  distributions  of  the 
predictions  and  residuals  obtained  from  Historical  Simulation  can  be 
develope-’  The  reader  is  cautioned  to  keep  in  mind  that  these  results 
are  only  valid  when  the  multiple  linear  regression  model  assumptions 
are  valid. 

This  appendix  is  divided  into  five  parts  or  sections:  The  first 
establishes  the  multiple  linear  regression  model  assumptions;  the 
second  develops  the  Historical  Simulation  procedure  (in  the  required 
notation);  the  third  derives  the  distribution  of  predictions  and  residuals 
when  only  one  subsample  is  used;  the  fourth  derives  similar  distributions 
from  two  subsamples;  and  the  last  section  summarizes  the  results. 

In  general,  the  predictions  (residuals)  are  normally  distributed 
and  correlated.  There  are,  however,  some  residuals  which  have  zero 
covariance  and  this  fact,  together  with  normalcy,  implies  that  they  are 
independent.  These  are  the  one-step  residuals,  i.e.,  those  residuals 
obtained  from  making  the  prediction  of  the  next  data  point  in  time. 

Before  describing  the  multiple  linear  regression  assumptions,  it 
will  be  useful  to  discuss  some  of  the  mechanics  of  the  statistical 
operators  E  for  expected  value  and  M  for  covariance  matrix.  An 
understanding  of  their  use  in  latrix  operations  is  a  prerequisite  to  the 
understanding  of  this  appendix. 


For  a  more  complete  discussion  see  Ref.  12,  Sec.  2.4. 
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The  expected  value  operator  H  is  the  easier  of  the  two.  Let 

l"  ,  V  ,  and  W  be  random  vectors;  A  and  B  nonrandom  transformation 
* 

matrices;  and  C  a  vector  of  constants.  Then  the  following  relationship 
defines  a  set  of  linear  equations 

T  =  AV  +  BW  +  C 

In  this  situation  the  expected  value  of  l'  is  defined  by 
i:(U)  =  AT.(V)  +  BK(W)  +  C 


The  covariance  operator  is  a  little  harder  to  understand.  Notation 
ally  it  will  he  used  in  the  three  ways  defined  below: 

1.  Let  l'  be  a  random  vector  (column);  then  its  covariance 
matrix  is  given  by 

=  K[U-E(U)  J  l  u — i: ( U )  J  ' 


whe  re 

[  ]'  stands  for  the  transpose  of  [  ] 

Let  U  and  V  be  two  random  vectors;  then  the  covariance 
►  •  * 

matrix  between  U  and  V  is  given  by 


No  t  e  : 


!•:[  r-i-:(i:) )  [ v—  !•: ( v )  \ ' 


i. 


Let  l  and  V  be  two  random  vectors;  then  their  joint 
covari  i  e  matrix  is  given  by 


M 
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From  these  definitions  it  can  be  shown  that  if  R  ,  U  ,  V  ,  and 
W  are  random  vectors,  and  A  is  a  nonrandom  transformation  matrix,  and 
if 


and 

we  have 


U  =  AV 

-y  -►  ► 

R  =  V  -  W 


M->  =  A  M+  A' 


M 


and 

In  addition,  we  have 


i  =  ^  +  MjS  -  ^ 


Ml 


U 

Lvj  , 


w 


and 


rui 

~R~ 

-> 

Lv- 

> 

LwJ 

"si 

"?,R 


^.w 

M->  -> 

v,w 


With 
to  set  out 


an  understanding  of  these  operations  in  hand,  we  are  ready 
the  multiple  linear  regression  model  assumptions. 


A.  MULTIPLE  LINEAR  REGRESSION  ASSUMPTIONS 

The  assumed  sample  consists  of  N  P+l-tuples  given  by  (y^,  x . j , 
x_^2>  •  ■  •  >  f°r  i  =  1,  2,  .  .  .  ,  N  .  For  the  Historical  Simula¬ 

tion  application,  the  P+l-tuples  have  been  time  ordered. 


See  pages  384-388  in  Ref.  7  for  a  more  complete  discussion  of  these 
assumptions . 
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I'.r  usii.il  rult  iple  I  inear  repression  hypotheses  are  given 


(35) 


V  is  a  N  •  1  column  vei'lor  whose  transpose  is  given  by 


1  v  i  »  v  ’  >  •  *  •  ’  v  y  ^ 


(3b) 


X  is  an  N  •  (P+1)  matrix  given  b y 

1  x 


x  ...  v 

11  I  J  X1  1’ 


1  X 


xjj  •  •  •  X2P 


XM  XNJ  •  •  •  x 


NP 


is  a  (P+1)  •  1  column  vector  given  by 


(37) 


'  0 


(38) 


is  an  N  •  1  column  vector  given  by 


(39) 
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The  matrix  X  and  the  vector  |-  are  assumed  to  he  nonrandom  while 

t  lias  a  multivariate  normal  distribution,  with  mean  vector  equal  to 

2 

zero,  a  constant  variance  i"  ,  anti  zero  covariances.  Hence 

.  =  N  (0 ,  o21n)  (40) 


where 


1  is  an  N-dimensional  identity  matrix 


(The  above  notation  is  shorthand  for  "t  is  normally  distributed  with 

E(e)  -  0  and  M  =  o2I  ,  .") 

e  N 

Note  that  the  ith  row  of  the  X  matrix  is  made  up  of  the  i^th 
P+l-tuple  of  the  sample  with  a  1  replacing  y^  .  The  1  is  used  as 

the  multiplier  of  the  constant  term  (■$  in  the  regression  equation. 
Defining 


(1>  Xi2’  *  *  •  *  ^  1,  2,  .  .  .  ,  N 


we  have  for  i  =  1,  2,  .  .  .  ,  N 


y  .  =  t .  +  [■;  +  >  i  .  x .  . 

i  i  0  Z-rf  j  ij 

j=l 


.  X  .  .  =  X  i-  i  . 

1  11  1  1 


Hence  Y  is  a  linear  combination  of  the  normal  random  variables 
in  c,  and  therefore  it  follows  that  Y  is  also  a  normally  distributed 
random  vector.  The  distribution  can  be  derived  from  Eq.  40  and  is  given 
by 

Y  =  N (X  3,  o2’n)  (43) 

i.e. ,  E(Y)  =  XB ,  and 


te  =  o  1 


UNCLASSIFIED 


117 
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! 

i 

't 


UNCLASSIFIED 


B  HISTORICAL  SIMULATION  PROCLDURL 

The  regression  assumptions  fit  into  Historical  Simulation  as 
described  in  the  following  paragraphs. 


Let  n  be  the  minimum  sample  size  for  Historical  Simulation.  it 
o 

is  necessary  that  n  be  greater  than  or  equal  to  the  smallest  sample 
size  necessary  to  carry  out  a  linear  regression  analysis.  Hence, 
n  •  P+1  .  For  any  n  ,  n^  j_  n  <  N  ,  define  the  following  partition 
of  the  X  matrix  (defined  in  Lq.  37)  by 


X 


x 

X2 


n  rows 
N-n  rows 


(44) 


Also  partition  Y  (defined  in  Eq.  35)  in  a  similar  manner  obtaining 


-► 

Y 


_i_ 

Y  (n) 

1  ? 


n  entries 
N-n  entries 


(45) 


-+(n)  ->(n) 

Note  that  for  this  partition  we  have  that  Y^  and  Y^  are 

independent,  a  consequence  of  Eq.  43.  Furthermore,  the  joint  covariance 
matrix  breaks  up  as  follows. 


Since 


n  cols.  N-n  cols. 
0 


o2I 


n 


o2I 


N-n 


n  rows 


Y'(n) 

9 


M^(n)  +  (n) 
2  ,  1 


Vn)?(n) 
1  ,2 
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we  have 


2t 

M  .  .  =  n  1 

3  (n)  n 

1 

2 

M  .  N  =  o  I 
^ (n)  N-n 

Y2 


Kn)^(n)  “  Y'(n)>(n) 

Y1  ,2  Y2  ,y1 


If  time  batches  are  ignored,  the  Historical  Simulation  procedure 

(for  the  multiple  linear  regression  model)  can  be  defined  as  follows: 

For  each  n  ,  n  <  n  <  N 
o  — 

1.  Make  a  least  squares  fit  using 

Y<n)  and  X^n) 

-+  ,  .  f(n 

2.  Obtain  an  estimating  vector  of  6  .  Denote  this  vector  ts 


Use  the  resulting  fit  to  make  predictions  of  the  remaining 
N-n  data  points.  This  can  be  denoted  by 


.  x<n)s(n:i 


where 


is  the  prediction  of  y  ,  using  a  sample  of  size  n 
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A. 


Calculate  Che 


residuals  by 


’  1  Ol)  " 

r  -  (n) 

d 

n+ 1 

yn+l 

-  yn+l 

.  (n) 

„(n) 

’  n+2 

yn+2 

-  yn+2 

,  (n) 

-  (n) 

L  N  J 

Lyn 

'  yN  . 

’/here 


y 


(n) 

n+k 


d^??  denoces 
n+k 

and  the  actual 


the  difference 
yn+k  • 


(48) 

(residual)  between  the  predicted 


After  the  Historical  Simulation  procedure  is  completed, 


N-l 

"y  ^  (N-n) 

n=n 


predictions  and  the  same  number  of  residuals  will  have  been  calculated. 
These  are  denoted  by  the  random  vectors 


and 


>o>  _  >o+1) 


(n  )  (ci  +1 ) 

LD  °  ,  D 


-^(N-D 


1 


D(N-1) 


respectively.  The  problem  is  to  find  the  distribution  of  these  random 
vectors . 


C.  DISTRIBUTION  OF  AND  ,  i.e.,  THE  PREDICTIONS 

AND  THE  RESIDUALS ,  FROM  ONE  SAMPLE  SIZE  n 

■3-  (n 

From  Ecj.  A 7  it  can  be  seen  that  the  form  and  distribution  of  L 

-4-  (n )  -*  (n ) 

must  be  established  before  the  distribution  of  Y  and  D  can  be 
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%  ( n )  * 

developed.  The  distribution  of  I'  la  well  known,  and  hence  only 

the  results  and  a  sketch  of  the  reason  Ing  will  he  presented  here.  Let 


,.(n)  _  v,(n)  (n) 

X1  X1 

where  X| is  the  transpose  of  x|n^  .  Then  the  least  squares 
solution  for  the  parameter  vector  ,■  Is  given  b y 


(4  9) 


’Hn)  _  „  (n)  1  ,  ( n )  y  ( n ) 

-  S  X]  Yj 


(50) 


where  S 


(n) 


is  the  inverse  of  S 


.  (n) 


Now,  the  only  random  variables  in  Eq .  50  are  Y 


v*(n) 


u  *<n) 

Hence  b 


is  a  linear  combination  of  the  normal  random  variables  y^,  y^,  .  .  .  ,  y^ 
and  hence  are  normally  distributed.  The  expected  value  and  the  variance- 
covariance  matrix  are  given  as  follows. 


and 


EB(n)  =  t 
M  =  o2S(n) 

t(n) 


-1 


Hence,  we  have  that 


£(n)  2  N(gja2s(n)"') 


(51) 


Furthermore,  since  S 
it  is  symmetric;  therefore 


(n) 


-1 


is  essentially  a  covariance  matrix, 


.  (n) 


-1 


,i  (n) 


-1 


(52) 


.  _  ,  _  ,  ...  ;Hn)  •  ,  rKn)  „(n)  ^(n) 

Now,  from  Eq .  47  the  predictions  Y  arc  given  by  Y  =  B 

4-(n)  X 

Hence,  Y  are  linger  combinations  of  the  normal  random  variables 


For  further  details  see  p.  386  in  Ref.  7. 
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and  art-  therefore  normally  distributed.  Tl.f  mean  vector  and 
covariance  matrix  are  calculated  below. 


K?(n) 


M_; 

Y 


(n) 


x<n,E 


(n) 


i:y 


00 


=  v<n>M  x'(n) 
2  "*(n)*2 


from  Kq .  51 

from  Kqs.  43,  A  A ,  and  4  5 


from  Kq.  51 


Hence,  we  have  that 


Y  (n) 


d 


N 


where 


(53) 


and 


F.Y(n)  ■  x<n)p  -  r.Y<n> 


\i  =  2v(n)£.(n)  1  yf  (n) 

Y (n)  2  2 


->  (n ) 

Finally,  the  distribution  of  the  residuals  D  can  now  be  found. 
From  Kq.  48  it  will  be  recalled  that 


P (n)  =  y  (n)  _ 

which  is  the  difference  of  two  normally  distributed  random  vectors. 

->  (n) 

Hence,  D  is  normal.  The  mean  vector  and  covariance  matrix  are 
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in  1  cul nted  below, 


'pin\ 

=  1-: 

i _  '  ^ 

-  H 

=  I-: 

’  (n) 

'  ’) 

l 

-  1. 

v,"1] 

from  Kq .  3  i 


and 


;(n)  “  Vn)  +  V"*  "  Vn)Y(n) 
>  Y  Y2  2  , 


M 


^(n)n'(n) 

.2 


Now 


(54) 


$(n) 


v(n)  (n) _1  ,  (n)-(n) 

X2  S  1  1 


from  Eqs.  47  and  48 


Hence 


M?(n) y(n) 

,  2 


(n)  (n)  1x,(n) 

X2  S  X1  M-(n)-(n) 

1  ,  2 


But  from  Eq .  46  we  have  that 


M  =0 

y(n) y(n) 

1  ,2 


as  Y^  and  Y^  are  independent, 


Hence , 


M«,  w  v  =  0 

,2 
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!'.y  m  i  tn  i  1 .  t  r  reasoning,  it  <_  «’t  rt  he  shown  that 


MY*(n)  Y*(«0  “  ° 


Kelli  i-  we  ii.iVe  from  It;.  V.  that 


'■i  i  :■! 

j(n)  Y(n) 


,2  ..  ( n )  ( n )  "  ,  (it) 

‘  i  *  * 


from  Kqs.  hi  and  53 


To  summarize,  then,  we  have  that 


r>(,,)  i  s  To, a 


where 


M,  .  N  +  M  ,  s 
Y  °  Y$n) 


2  (n)  (n)  1  ,  (n) 

0  X2  b  X2  Vn 


I).  DISTRIBUTION  OF  PREDICTIONS  AND  RESIDUALS  FROM  TWO  SUBSAMPLES 

So  far  we  have  been  addressing  the  distribution  of  the  predictions 
and  the  residuals  from  one  subsample.  The  problem  now  is  to  find  the 
joint  distributions  of 

5>1>  :>2} 

Y  ,  Y 

and  the  joint  distribution  of 

k(n  )  y(n  ) 

D  ,  D 

where  n  *•  n  <  n  <  N  .  That  they  are  normal  has  been  determined  in 
o—l  2 

Sec.  C,  a?  has  the  value  of  their  expected  values.  In  addition,  we  have 
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determined  pari  of  their  cu variance  matrices  in  Sec.  C,  namely 


M- 

y 


(i>i ) 


and 


M  • 


j(n>) 


iron.  !  q  .  5J 


and 


;(»i> 


ami 


M 

n 


(nv) 


from  l.q .  *>!> 


Still  to  be  determined  arc  M-.  ,  .  ,  and  M  .  ,  .  .  (and  their 

v<»,> ,y<"2)  u<"i>.u<"2> 

t ranapoHes) .  • 


At  this  point  it  is  necessary  to  introduce  some  additional  notation, 

•  * 

Let  the  X  matrix  and  the  Y  vector  be  partitioned  as  follows: 


and 


X  = 


(nj ) 


(n2) 


rows 


n2~n^  rows 


N-n2  rows 


(56) 


Y  = 


:(nl} 


Jn2) 

Y 

LY2 


entries 


n2~n^  entries 


N-n2  entries 


J 
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hi-  r  i-  1  .it  lull  ah  i  |>  u  !  I  hla  tli-U  |i.i  I  (  1  t  1  mi  In  the  partition  previously 


!!*n  usi)i*i!  i  a  .i  a  !  *  •  !  1  « *Wta  ! 


..(»■») 


Hence  X  and  Y  are  the  regions  of  shift  In  the  partition  when 
the  Historical  Simulation  procedure  passes  from  a  subsample  of  size  n^ 
to  one  of  size  n,  .  X  is  not  used  in  the  fit  routine  when  the  sub- 
sample  size  is  n  ,  but  it  is  used  when  the  subsample  size  is  n^  . 
Similarly,  predictions  are  made  of  Y  when  the  subsample  size  is  n^  , 
but  they  tire  not  made  when  the  subsample  size  is  n^  • 


by  analogous  reasoning  to  that  used  to  establish  Eq.  46  and  the 
*(n)  •  ~v(n) 

mutual  independence  of  1  ,  Y,  Y?  ^  ,  which  is  a  consequence  of 

Eq.  43,  it  can  be  shown  that 
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and 


These 


and 

Hence 


Yj v“l  ’ 


■  1 


tl  " tl 

:  1 


(i\  ■< ) 


.\-!t 


.  ’  "y  (|,1)  .  *  '  v  .  Y* ,  < n , ) 

1  *Y  1  >'r>  *  * 

[and  their  transposes  ail  eijunl  /cro.l 


We  now  turn  hack  to  the  problem  of  ca  1  c  u  1  at  !  tip 


M  ,  s  .  .  and  .’•! 

4(n.)  ^(n  )  (n  )  (n.,) 

Y  ,Y 


I J  ,U 


are  calculated  below: 


From  (4  7)  and  (50)  we  have  that 


4(nl) 


■1 


(n.)  (n  )  (n  )  (n  ) 

x2  s  *; 


4  (n2  ) 
Y 


-1 


(n  )  (n„)  (n.,)  (n., ) 

X2  S  xi  Y'l 


,-l 


M 


(n.,)  ( n ., ) 


4 (ni  )  4  (no ) 

Y  ,Y 


(n  )  (n  )  (n  ) 

x?  s  x;  1  m  .  .  .  .x,  -  S 

2  1  ,(n  )  >  )  1 

Y  Y 
1  ’  1 
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hut 


.  (*>>) 
v:  - 


,  (O' 

Y!  ’  1  Y' 


IlflKl' 


„(n  )  (n.,) 

V  ’  '  V  *" 


Y!  *Y1 


(n.)  (n  ) 

V.  '  V  1  .Y 


■)  I 

i~]  |  () 

H  1  I 


by  Eq .  58 


lli'iu'i1  wi*  lta vf 


M  .  ,  ,  ,  =  X  ,  S 


(tij)  (n j )  1  (rtj) 


.  (a ,  )  (n.,) 

Y  >  Y  “ 


o  I  1  0 
.  ni  I 


(n  )  (n  )  1  (n  ) 
xx  S  X), 


hut 


(a.,) 


(n  j  ) 


from  Eq.  57 


IllMlCU 


I  I  0 
"l  I 


(n.,)  ,  (n  )  2  (n  ) 

“  =  )“I  X.  +0  =  o  X. 

1  n ,  1  1 


I’.v  subs  L  i  tut  i  on  wt>  then  have 


''.(a.)  » ( n . , ) 
Y  ,  Y  “ 


(n  )  (n  )  1  (n  )  (n  )  (n  )  1  (n  ) 
o“X.,  S  X]  1  Xj  S  X' 


(n.)  (n  )  1  (n  )  (n„)  1  (n  )  from  Eq.  <49 

’“x.,  s  S  S'  xj 


,  (n  )  ( n ., )  J  (n7) 

‘X.,  S  -  X',  -  from  Eq.  52 


(it.,) 


(n7) 


-1 


x< 


(n2) 


from  Eq.  57 
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But  from  Kq.  53  we  have 


(n  )  (n9)  1  (n  ) 

M  ,  v  =  <'  X-  S  -  XI 

%(n„)  2  2 

Y 


Hence 


M 


4(n  )  4(n„) 
Y  1  Y  Z 


7  (n„)  1  (n  ) 

o  \S  X’ 


M 


^(n2) 

Y 


(59) 


4(n1)  4(n  ) 

The  joint  distribution  of  Y  ,Y  cnn  now  be  summarized  from 


Eqs.  53  and 

59  as 

follows : 

W" 

Y 

±  N 

(n  )" 
EY2 

Y(n2} 

Ly  J 

(n  ) 

L  ky2  J 

where 


j  M  r- 


Y(n  ])’ 
Y 

\  (  n  9  ) 


(60) 


N-n^  Cols 


N-n.  Cols 


^(n  ) 

Y 

:>(n2) 

Y 


M 


Y(nl} 


A  i  M 


(n2) 


M 


r>(n.() 

Y  “ 


n0-n^  N-n ^ 

Cols  Cols 

...  (n9)  ]  (n  ) 
A  =  o  XS  X.!, 


n9-iij  rows 


\’-n9  rows 


N-n9  rows 


and  the  otlier  entries  are  defined  by  Kq .  53. 
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II  is  rather  interesting  to  note  that  the  covariances  defined 
in  (59)  are  not  dependent  on 


The  implications  of  this  are  that  the  covariance  of  predictions  made 
from  different  subsample  sizes  are  dependent  only  on  the  data  matrix 
used  in  the  larger  subsample  size  fit.  In  particular 


COV 


’.('V  -  ( n 2 ) ~ 

"„(n  )  A  (n^) 

Yn^+k  ’  Yn,;+j 

|  =  COV  j 

yn1+k  ’  Yn2+j 

(61) 


for  n  +k  N  n.,  ;  i.e.,  y  .  was  predicted  from  both  subsamples. 

1  n^+k 


Turning  to  the  last  task, 


M 


(n  )  ^(n,,) 
D  ,D  “ 


must  be  calculated.  From  Eq.  48  we  have  that 


(n.)  ,(ni) 

>  -  Y  -  Y 


and 


(n„)  a(n  )  _  y(0 
)  “  =  Y  Y2 


Hence 


M  (n  )  (n  )  ^’(n  )  (n -)  +  M  (n, )  „(n  )  M^(nJ  *(n,) 

1)  ,  D  Y  ,Y  y2  »y2  Y2  >Y 


M 


_%(n.)  „(n„) 

Y  1  Y  ^ 

i  9  1  2 
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Taking  this  expression  term  by  term  we  have  that 


4(n  )  ~(n  ) 
Y  ,Y 


9  (n„)  (n  ) 

oZXS  Z  X’ 


4(n2} 

Y 


by  Eq.  59 


->(nX)  -+^n2} 


Y  Y 
2  ’2 


->2>  Y2 

Y 


Y,Y^n2} 


>2> 

Y2 


from  Eq.  57 


from  Eq .  58 


Hence 


Jn  )  (n  )  9 

Y  Y  o  I 

2  ’2  N-n, 


From  Eqs.  A 7  and  50  we  have  that 

^(n  )  (n  )  (n  )  1  (n  )_^(n  ) 

Y  =  X2  s  x;  yt 


Hence 


">,>  »(n2)  ‘  "Xnp  >2>  5 
Y2  'Y  V2  'Y1 


(n9)  (n  )  (n9) 

“  c :  Z  V  - 


Hut  by  Eq.  57 


(n  )  Y  ( 

Y2  =  ^(n  )  3nd  Y1 

LY2  J 


r  >j) 

(n2)  Y1 
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Hence 


MY  Y(nl} 

’1 

M 

Y,Y 

0  1  02I 

'  n2_nl 

M,(n  )  >  ) 

y2  ,y. 

M  /  \ 
->(n2  . 

Y2  «Y1 

Y2  ’Y. 

by  Eq.  58 

Hence 


But 


Hence 


i  2 

0  i 0 
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i 

0  , 

0 
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(n2) 

5 
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1 

X 
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-i 

0  0 

I 
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^(n„) 

0 

» Y 

0 

(n,)  (n,)"1  (n  ) 

X1  2  s  2  x' 


by  Eq.  57 


(n, ) 


(n„)  1  (n  ) 
1  x2 


Performing  the  multiplication  we  then  have 


9  (n?)  1  (n?) 

o2XS  X^ 

,(n.)  ^(n  ) 

y  1  Y  ^ 

*2 

0 

The  final  term  is 


M 


>1> 

Y  ,Y2 
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From  Eqs.  47  and  !50  we  have  that 


4(V  OO  (n  )  (n-Mn.) 

Y  =  X2  S  X1 


Hence 


(ni)  (n,  )  (n  ) 

\(nx)  >  )  =  X2  S  Xi  \(n  )  >  ) 

Y1  -Y2  h  -y2 


Y1  ’Y2 


from  Eq.  58 


Hence 


W  Jn2) 

Y  Y 
’2 


Collecting  underlined  terms  we  therefore  have  that 


Jnl>  ->2} 

D  ,D 


o  (O  1  (n0 ) 

o  XS  X'  “ 


4.  (n2  ) 

Y 


2  (  n  ~  ) 

o  .\s  x;  1 


Hence 


>1>  ->2} 
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M  ,  .  +  o  I 

->2}  N"n2 

{  * 
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and  by  Hq .  ‘55  we  have 


►  ( n  1 )  >2} 
D  ,  D 


0 


(62) 


This  result  has  some  rather  interesting  consequences.  As  in  the 
case  of  the  predictions,  covariance  between  residuals  depends  only  on 
tiie  information  used  in  the  fit  performed  on  the  larger  of  the  two  sub¬ 
samples  used  to  generate  the  residuals.  Hence,  we  have  an  analogous 
result  to  F.q.  60,  namely 


COV 


(n  ) 

J  -L 

~n  +k 


(n  ) 
i 

n2+J 


COV 


(O 

dn1+k 


(63) 


for  n,+k  ■  n,,  ;  i.e.,  residual  calculations  were  made  for  y  ,  from 
1  l  n^+k 

both  subsamples. 


An  even  more  interesting  consequence  is  that  the  covariance 
between  residuals,  one  of  which  is  not  calculated  for  both  subsamples, 
is  zero.  Hence 


COV 


(n  ) 

1  , 
n  j+k 


(n7) 

d  7  • 

n9+J 


0 


if  nj+k  n., 


In  particular,  the  one-step  residuals 


(64) 


<n  ) 

o 

n 


o 


d 


(N-l) 

N 


have  zero  covariances.  This  fact,  coupled  witli  a  normal  distribution, 
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implies  independence.  Hence,  the  one-step  residuals  are  mutually 
independent . 


The  proof  of  this  assertion  is  not  too  difficult.  Consider  two  normally 
distributed  random  variables  U  and  V  .  Let 


U  =  N 


V  =  N 


[  E(U),oJ  j 

[  E(V),aJ  ] 


COV(UV)  =  0 


fU(u)  = 


(u  -  i-:u)2 


e  ,  s  1  (V  -  F.V) 2 

f,,  (v)  =  -  exp - - - 

v  J1  2  2o 2 

V  2ti  ov  V 


r  ,  v  1  u  -  EE  ,  (v  -  EV) 

fII  v(u’v)  =  - r~  °>;P - 2 -  +  - 2 - 

U’X  2r0UJV  2-  2; 

U  V  L  V 


where  f  and  f  are  the  density  functions  of  U  and  V  and  f 

w  \  L  j  \ 

is  the  joint  density  function  of  U  and  V  . 


Now,  according  to  Kef.  7,  pa>;e  I'll,  1'  and  V  are  independent  if 


fU,V  (fU)  (fV 


This  is  clenrlv  the  case  for  the  densities  defined  above 
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- ►  ( n j  )  -v(n2) 

The  joint  distribution  of  U  and  D  can  now  be  summarized 

usinp.  Kqs.  S3  and  (>2.  Wo  have  that 


who  re 


r  (»,)' 


/»•,) 

U  1) 


0  ,  M 


\(nL)_l 
D 


LD 


(n2) 
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i  -^n2^ 
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Co  1  s 


N-n.,  .N-n., 

(in  Is  Cols 


n2_ni  rows 


N-n 2  rows 


N-n2  rows 


I. .  Sl'MMAKY 

With  the  help  of  some  additional  notation,  it  is  possible  to 

:i  ui'T'.j  r  i  /.<.•  ami  simplify  the  results  of  the  preceding  sections  and  combine 

all  tlie  infer,  alien  about  the  form  and  distribution  of  the  predictions 

uii!  re-,  i  du  i  Is  into  one  table.  Recall  from  lin.  41  that  x !  is  a  row 

l 

vei  tor  « ■  c; u. 1 1  to  the  it!)  row  of  the  X  matrix.  Now,  define  a  set  of 
oils  t  an  t  s  b v 

,  -  1 

(nM  =  ,•>  sl'U 

"ii  r  i  ' 

rii  is  ..  .  e  i  tr  it  '.inter  !  o  r  respond  i  op,  to  t  he  ,  t  h  column  o  I  X 

i  i  :.i.'  ; i  .  '  •.  ) . 
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Then  by  examining  the  results  of  sections  B,  C,  and  D  of  this 
appendix,  one  can  obtain  the  form  and  distribution  of  the  Predictions 
and  Residuals  given  in  Table  25.  Reference  numbers  of  the  equations 
and  facts  derived  in  sections  B,  C,  and  D  are  given  in  parentheses. 

Gome  of  the  conclusions  that  can  be  drawn  from  this  table  are 
given  beiow. 


1.  The  residual  covariances  are  very  similar  to  the  prediction 
covariances.  In  fact,  they  are  equal  unless  one  of  the  points 
being  predicted  is  not  predicted  from  both  subsamples,  or  the 
point  being  predicted  is;  the  same  for  both  subsamples.  In  the 
first  of  these  exceptions  the  residual  covariance  is  zero. 

In  the  second  exception,  the  calculation  is  similar  to  a 

variance  calculation.  The  residual  variance  is  obtained  bv  adding 

2 

o  to  the  prediction  variance.  In  like  manner,  the  residual 

2 

covariance  is  obtained  by  adding  •  to  the  prediction  covariance. 

2.  The  covariances  depend  on  the  two  subsample  sizes  m  and  n 

only  insofar  as  which  S-matrix  to  use.  If  m  •  n  then  is 

used.  If  m  >  n  then  is  used.  This  is  the  only  difference 

between  the  coefficients  CIr|7  and  C'n+^  .  The  rule  to  follow  is 

n+k  m+j 

always  use  the  S-matrix  corresponding  to  the  larger  sample  size. 


3.  In  general  the  predictions  and  residuals  are  correlated 
(among  themselves).  However, 


COV 


.  ( m ) 

d  ,  . 

Iir+J 


0 


if  n+k  _  m  or  m+j  _  n  .  This  means  in  particular  that  the  one- 

step  residuals  d^1,1]  ,  n  •  n  •  h  ,  are  uueorre  1  at  ed .  A*,  di-.iussed 

n+l  e  — 

in  section  I),  this  zero  eovar  i  am  e ,  toeether  with  a  normal  distribu¬ 
tion,  implies  I  ndenundeiue . 
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TABLE  25 

FORM  AND  DISTRIBUTION  OF  PREDICTIONS  AND  RESIDUALS 
(Assuming  the  usual  multiple  linear  regression  model  assumptions) 
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AIM’KNI)  IX  III 

PROI’KKT  1  l!S  DF  VAK1ANCK  I.S'I  IMA'I  OKS 

In  this  appendix,  as  in  Appendix  11,  the  multiple  linear  regression 
model  will  be  assumed.  This  model  has  been  described  in  Appendix  II, 
l.qs.  J5  through  A  T . 

In  Sec.  IV  C  of  the  hotly  of  tin  report,  modified  residuals  were 

derived  which  are  theoretically  a  random  sample  from  a  normal  population 

) 

with  mean  0  and  variance  “--the  variance  of  the  error  terms  in 

the  regression  model,  liq .  AO.  These  modii  i.-l  residuals  are  denoted  by 

r  =  (r  ,r  r,  ,  )  where  n  is  the  minima'::  samilc  size 

n  n  +1  1.-1  o 

for  Historical  Simulation  anu  N  is  the  size  ot  the  data  base. 

A  Ko lmogoro v-Sm i rno v  (K-S)  Coodness-o f -F i t  list  has  been  suggested, 
in  Sec.  IV  C.  In  order  to  apply  this  test,  however,  an  estimate  of  t In¬ 
variance  ■“  must  he  made.  three  possible  candidates  have  been  considered. 
In  this  appendix  these  candidates  are  described;  distributions  are  derived 
for  the  case  when  the  multiple  linear  regression  model  holds  (which  is 
the  null  hypothesis  in  the  h-S  test),  and  relative  efficiencies  are 
discussed.  A  selection  is  then  made  for  use  in  the  :  ->  test. 

two  estimates  of  the  variance  can  be  calculated  directly  from  the 
output  r  .  iheso  are  the  sample  variance  S“  and  the  zero-mean  sample 
variance  •’  .  The  respective  equal  ions  are  given  *> v 

S“  -  ,  -  VV  -  rV 

r  \  -  (n  ♦  1  i  /  j  l 

l  -n 

o 


1  J'i 
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and 


O 


(66) 


The  only  difference  between  the  two  equations  is  that  is  calculated 

around  the  sample  mean, 


N-l 


while  o*"  assumes  the  mean  is  zero. 

It  is  a  known  fact  (Ref.  7,  pages  315-316)  that 

[N  -  („o+l)]sJ 

2 

o 


(67) 


has  a  chi-square  distribution  with  N  -  (nQ+l)  degrees  of  freedom  while 

(N  -  n  )  -2 
_ o 

o 


has  a  chi-square  distribution  with  N-n  degrees  of  freedom.  (One 

2  ° 

degree  of  freedom  is  lost  by  because  of  the  use  of  r  in  its 

calculat ion . ) 
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Now  the  chi-square  distribution  with  K  degrees  of  freedom  nas  an 

expected  value  of  K  and  a  variance  equal  to  2K  .  Hence  the  expf  ted 

-2 

value  E  and  variance  of  o  can  be  derived  as  follows: 


(N-n  )  /  (N-n  )o‘ 

o  ,.'2  ,,  /  o 

—  Eo  =  E  ( - j -  )=  N-no 


and 


(N-n  )"  / (N-n  ) o" 

- 7 -  VAR  o  =  VAR - 


2  (N-n  ) 
o 


Therefore 


r~2  2 

Eo  =o 


and 


VAR  52  =  2° 


N-n 


Similarly  it  can  be  shown  that 


and 


2  2 
ES  =  o 

r 


VAR  S  =  — 
r  N 


2a 


(no+1) 


(68) 


(69) 


The  final  candidate  for  a  variance  estimator  is  o  ,  the  square  of 

the  standard  error  of  the  estimate  obtained  from  a  regression  analysis  on 

-2 

the  entire  data  base  (i.e.,  sample  size  N) .  The  equation  for  o  is 
given  by 


(70) 
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wl'ere  yi  is  the  actual  cost  of  the  i^h  procurement 

is  the  estimated  cost  of  the  i^th  procurement  (obtained 
from  a  regression  analysis  of  the  entire  sample) 

an<i  l3  is  the  number  of  independent  variables  in  the  linear  model 

(PER) 

It  is  known  under  the  regression  model  assumptions  (Ref.  7,  page 
364)  that 


1 N  -  (P+l)]o2 

1 


has  a  chi-square  distribution  with 
ing  the  derivations  of  Iqs .  68  and 


N  -  (P+1)  degrees  of  freedom. 
69  we  then  have 


Follow- 


VAR 


N  -  (P+1) 


(71) 


The  facts  obtained  to  this  point  are  summarized  in  Table  26.  We 
now  address  the  question  of  which  of  these  estimators  is  best  to  use  in 
the  K-S  test. 

As  can  be  seen  from  the  table,  the  candidate  estimators  are  all 

2 

unbiased  ,  i  . e  . ,  tiieir  expected  value  is  o  ,  the  quantity  that  is  being 
estimated.  In  addition,  the)’  are  all  consistent  since  the  variance 
converges  to  zero  as  the  sample  size  N  gets  large. 

Difference  in  the  estimators  can,  however,  be  seen  when  their 
reLatlve  efficiencies  are  examined.  According  to  Ref.  7,  page  216,  if 
tiie  estimators  are  unbiased,  then  the  one  with  the  smallest  variance  is 
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TABLE  26 

CANDIDATE  VARIANCE  ESTIMATORS 


Notation 

Description 

Equation 

Expected 

Value 

Variance 

s2 

r 

Sample  variance  of  r 

64 

2 

0 

o  4 

2a 

N  -  (n  +1) 

0 

~2 

0 

Zero  mean  sample 
variance  of  r 

65 

2 

0 

o  4 

2a 

N-n 

o 

-2 

0 

Square  of  standard 
error  of  the  esti¬ 
mate  obtained  from 
regression  analysis 
of  entire  sample 

69 

2 

0 

9  4 

2a 

N  -  (P+1) 

more  efficient.  From  Sec.  Ill  C,  it  was  pointed  out  that  the  minimum 

sample  size  for  Historical  Simulation  must  be  larger  than  the  number  of 

parameters  to  be  estimated.  In  the  case  being  considered,  P+1  parameters 

are  to  be  estimated,  one  for  each  independent  variable  and  one  for  the 

constant  term.  Hence  n  >  P+1  .  This  implies  that  for  any  sample  size 
2  ~  2  ^  2  2 

N  ,  VAR  o  <  VAR  o  <  VAR  S  .  Hence,  o  is  the  most  efficient  of 

r 

the  three  candidates. 

-2 

Using  efficiency  as  the  criterion,  o  would  then  be  selected  as 

2 

the  estimator  of  the  variance  o  .  It  will  be  noted,  however,  that  as 
N  gets  large,  the  differences  in  the  variance  of  the  estimators  gets 
small,  for  example 


VAR  o2 
VAR  o2 


N  -  (P+1) 


N-n 

o 
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converges  to  1  as  N  gets  large.  Hence,  the  advantage  in  efficiency 
for  o'-  is  only  significant  for  small  N  .  This  is  of  course  precisely 
the  situation  usually  faced  by  the  cost  analyst. 


« 2  -2  2  „  2 
Another  advantage  for  choosing  o  over  o  '  or  S  is  that  o 

is  a  function  of  the  fit  residuals  from  a  regression  analysis  on  the 

entire  sample,  rather  than  the  prediction  residuals  from  Historical 

Simulation.  Hence  it  is  more  independent  (in  the  non-statistical  sense) 

of  the  prediction  residuals  than  the  other  estimators  ,  since  a  does  not 

depend  directly  on  the  values  in  the  Historical  Simulation  residual 

vector  r.  Therefore,  o  more  closely  represents  the  given  value  (as 

2 

compared  to  an  estimate)  of  o  that  is  called  for  in  the  K-S  test. 


For  the  reasons  discussed  above,  o  has  been  selected  as  the  esti- 
2 

mate  of  o'"  for  the  K-S  test. 
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A  recurring  problem  faced  by  many  analysts  is  that  of  devising  estimating  pro¬ 
cedures  for  predicting  some  aspect  of  the  future  from  rather  meager  data.  This  is 
particularly  true  for  the  cost  analyst  who  is  concerned  with  estimating  the 
resource  requirements  of  future  military  systems. 

Historical  Simulation  is  a  method  of  evaluating  candidate  (cost)  estimating  pro¬ 
cedures  on  the  basis  of  their  ability  to  simulate  predictions  using  data  that 
would  have  been  available.  In  this  fashion.  Historical  Simulation  avoids  relying 
on  the  central  evaluation  assumption  of  Regression  Theory,  namely,  that  which  fits 
the  past  data  best  will  predict  the  future  best.  This  conceptual  difference  gives 
Histroical  Simulation  several  unique  features. 

The  report  is  divided  into  two  volumes.  The  first  volume,  which  is  unclassified, 
completely  describes  the  technique  and  unique  features.  Volume  two,  classified 
confidential  (privileged  information),  illustrates  the  use  of  Historical  Simula¬ 
tion  bv  describing  the  results  of  applying  the  technique  to  cost  and  man-hour 
estimating  procedures  for  selected  aircraft  programs.  A  non-technical  overview 
of  the  contents  of  these  documents  can  be  found  in  General  Research  Corporation 
1 MR- 9 50. 
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